Pub Date : 2024-12-20DOI: 10.1007/s40747-024-01705-8
Lei Chen, Tieyong Cao, Yunfei Zheng, Yang Wang, Bo Zhang, Jibin Yang
There is a phenomenon that better teachers cannot teach out better students in knowledge distillation due to the capacity mismatch. Especially in pixel-level object segmentation, there are some challenging pixels that are difficult for the student model to learn. Even if the student model learns from the teacher model for each pixel, the student’s performance still struggles to show significant improvement. Mimicking the learning process of human beings from easy to difficult, a dynamic dropout self-distillation method for object segmentation is proposed, which solves this problem by discarding the knowledge that the student struggles to learn. Firstly, the pixels where there is a significant difference between the teacher and student models are found according to the predicted probabilities. And these pixels are defined as difficult-to-learn pixel for the student model. Secondly, a dynamic dropout strategy is proposed to match the capability variation of the student model, which is used to discard the pixels with hard knowledge for the student model. Finally, to validate the effectiveness of the proposed method, a simple student model for object segmentation and a virtual teacher model with perfect segmentation accuracy are constructed. Experiment results on four public datasets demonstrate that, when there is a large performance gap between the teacher and student models, the proposed self-distillation method is more effective in improving the performance of the student model compared to other methods.
{"title":"A dynamic dropout self-distillation method for object segmentation","authors":"Lei Chen, Tieyong Cao, Yunfei Zheng, Yang Wang, Bo Zhang, Jibin Yang","doi":"10.1007/s40747-024-01705-8","DOIUrl":"https://doi.org/10.1007/s40747-024-01705-8","url":null,"abstract":"<p>There is a phenomenon that better teachers cannot teach out better students in knowledge distillation due to the capacity mismatch. Especially in pixel-level object segmentation, there are some challenging pixels that are difficult for the student model to learn. Even if the student model learns from the teacher model for each pixel, the student’s performance still struggles to show significant improvement. Mimicking the learning process of human beings from easy to difficult, a dynamic dropout self-distillation method for object segmentation is proposed, which solves this problem by discarding the knowledge that the student struggles to learn. Firstly, the pixels where there is a significant difference between the teacher and student models are found according to the predicted probabilities. And these pixels are defined as difficult-to-learn pixel for the student model. Secondly, a dynamic dropout strategy is proposed to match the capability variation of the student model, which is used to discard the pixels with hard knowledge for the student model. Finally, to validate the effectiveness of the proposed method, a simple student model for object segmentation and a virtual teacher model with perfect segmentation accuracy are constructed. Experiment results on four public datasets demonstrate that, when there is a large performance gap between the teacher and student models, the proposed self-distillation method is more effective in improving the performance of the student model compared to other methods.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"23 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142857995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01672-0
Zhenhai Wang, Lutao Yuan, Ying Ren, Sen Zhang, Hongyu Tian
The most common method for visual object tracking involves feeding an image pair comprising a template image and search region into a tracker. The tracker uses a backbone to process the information in the image pair. In pure Transformer-based frameworks, redundant information in image pairs exists throughout the tracking process and the corresponding negative tokens consume the same computational resources as the positive tokens while degrading the performance of the tracker. Therefore, we propose to solve this problem using an adaptive dynamic sampling strategy in a pure Transformer-based tracker, known as ADSTrack. ADSTrack progressively reduces irrelevant, redundant negative tokens in the search region that are not related to the tracked objectand the effect of noise generated by these tokens. The adaptive dynamic sampling strategy enhances the performance of the tracker by scoring and adaptive sampling of important tokens, and the number of tokens sampled varies according to the input image. Moreover, the adaptive dynamic sampling strategy is a parameterless token sampling strategy that does not use additional parameters. We add several extra tokens as auxiliary tokens to the backbone to further optimize the feature map. We extensively evaluate ADSTrack, achieving satisfactory results for seven test sets, including UAV123 and LaSOT.
{"title":"ADSTrack: adaptive dynamic sampling for visual tracking","authors":"Zhenhai Wang, Lutao Yuan, Ying Ren, Sen Zhang, Hongyu Tian","doi":"10.1007/s40747-024-01672-0","DOIUrl":"https://doi.org/10.1007/s40747-024-01672-0","url":null,"abstract":"<p>The most common method for visual object tracking involves feeding an image pair comprising a template image and search region into a tracker. The tracker uses a backbone to process the information in the image pair. In pure Transformer-based frameworks, redundant information in image pairs exists throughout the tracking process and the corresponding negative tokens consume the same computational resources as the positive tokens while degrading the performance of the tracker. Therefore, we propose to solve this problem using an adaptive dynamic sampling strategy in a pure Transformer-based tracker, known as ADSTrack. ADSTrack progressively reduces irrelevant, redundant negative tokens in the search region that are not related to the tracked objectand the effect of noise generated by these tokens. The adaptive dynamic sampling strategy enhances the performance of the tracker by scoring and adaptive sampling of important tokens, and the number of tokens sampled varies according to the input image. Moreover, the adaptive dynamic sampling strategy is a parameterless token sampling strategy that does not use additional parameters. We add several extra tokens as auxiliary tokens to the backbone to further optimize the feature map. We extensively evaluate ADSTrack, achieving satisfactory results for seven test sets, including UAV123 and LaSOT.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"13 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142849199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given the remarkable text generation capabilities of pre-trained language models, impressive results have been realized in graph-to-text generation. However, while learning from knowledge graphs, these language models are unable to fully grasp the structural information of the graph, leading to logical errors and missing key information. Therefore, an important research direction is to minimize the loss of graph structural information during the model training process. We propose a framework named Edge-Optimized Multi-Level Information refinement (EMLR), which aims to maximize the retention of the graph’s structural information from an edge perspective. Based on this framework, we further propose a new graph generation model, named TriELMR, highlighting the comprehensive interactive learning relationship between the model and the graph structure, as well as the importance of edges in the graph structure. TriELMR adopts three main strategies to reduce information loss during learning: (1) Knowledge Sequence Optimization; (2) EMLR Framework; and (3) Graph Activation Function. Experimental results reveal that TriELMR exhibits exceptional performance across various benchmark tests, especially on the webnlgv2.0 and Event Narrative datasets, achieving BLEU-4 scores of (66.5%) and (37.27%), respectively, surpassing the state-of-the-art models. These demonstrate the advantages of TriELMR in maintaining the accuracy of graph structural information.
{"title":"Edge-centric optimization: a novel strategy for minimizing information loss in graph-to-text generation","authors":"Zheng Yao, Jingyuan Li, Jianhe Cen, Shiqi Sun, Dahu Yin, Yuanzhuo Wang","doi":"10.1007/s40747-024-01690-y","DOIUrl":"https://doi.org/10.1007/s40747-024-01690-y","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Abstract</h3><p>Given the remarkable text generation capabilities of pre-trained language models, impressive results have been realized in graph-to-text generation. However, while learning from knowledge graphs, these language models are unable to fully grasp the structural information of the graph, leading to logical errors and missing key information. Therefore, an important research direction is to minimize the loss of graph structural information during the model training process. We propose a framework named Edge-Optimized Multi-Level Information refinement (EMLR), which aims to maximize the retention of the graph’s structural information from an edge perspective. Based on this framework, we further propose a new graph generation model, named TriELMR, highlighting the comprehensive interactive learning relationship between the model and the graph structure, as well as the importance of edges in the graph structure. TriELMR adopts three main strategies to reduce information loss during learning: (1) Knowledge Sequence Optimization; (2) EMLR Framework; and (3) Graph Activation Function. Experimental results reveal that TriELMR exhibits exceptional performance across various benchmark tests, especially on the webnlgv2.0 and Event Narrative datasets, achieving BLEU-4 scores of <span>(66.5%)</span> and <span>(37.27%)</span>, respectively, surpassing the state-of-the-art models. These demonstrate the advantages of TriELMR in maintaining the accuracy of graph structural information.</p><h3 data-test=\"abstract-sub-heading\">Graphical abstract</h3>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"114 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142849197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01683-x
Shakir Bilal, Wajdi Zaatour, Yilian Alonso Otano, Arindam Saha, Ken Newcomb, Soo Kim, Jun Kim, Raveena Ginjala, Derek Groen, Edwin Michael
The COVID-19 pandemic has dramatically highlighted the importance of developing simulation systems for quickly characterizing and providing spatio-temporal forecasts of infection spread dynamics that take specific accounts of the population and spatial heterogeneities that govern pathogen transmission in real-world communities. Developing such computational systems must also overcome the cold-start problem related to the inevitable scarce early data and extant knowledge regarding a novel pathogen’s transmissibility and virulence, while addressing changing population behavior and policy options as a pandemic evolves. Here, we describe how we have coupled advances in the construction of digital or virtual models of real-world cities with an agile, modular, agent-based model of viral transmission and data from navigation and social media interactions, to overcome these challenges in order to provide a new simulation tool, CitySEIRCast, that can model viral spread at the sub-national level. Our data pipelines and workflows are designed purposefully to be flexible and scalable so that we can implement the system on hybrid cloud/cluster systems and be agile enough to address different population settings and indeed, diseases. Our simulation results demonstrate that CitySEIRCast can provide the timely high resolution spatio-temporal epidemic predictions required for supporting situational awareness of the state of a pandemic as well as for facilitating assessments of vulnerable sub-populations and locations and evaluations of the impacts of implemented interventions, inclusive of the effects of population behavioral response to fluctuations in case incidence. This work arose in response to requests from county agencies to support their work on COVID-19 monitoring, risk assessment, and planning, and using the described workflows, we were able to provide uninterrupted bi-weekly simulations to guide their efforts for over a year from late 2021 to 2023. We discuss future work that can significantly improve the scalability and real-time application of this digital city-based epidemic modelling system, such that validated predictions and forecasts of the paths that may followed by a contagion both over time and space can be used to anticipate the spread dynamics, risky groups and regions, and options for responding effectively to a complex epidemic.
{"title":"CitySEIRCast: an agent-based city digital twin for pandemic analysis and simulation","authors":"Shakir Bilal, Wajdi Zaatour, Yilian Alonso Otano, Arindam Saha, Ken Newcomb, Soo Kim, Jun Kim, Raveena Ginjala, Derek Groen, Edwin Michael","doi":"10.1007/s40747-024-01683-x","DOIUrl":"https://doi.org/10.1007/s40747-024-01683-x","url":null,"abstract":"<p>The COVID-19 pandemic has dramatically highlighted the importance of developing simulation systems for quickly characterizing and providing spatio-temporal forecasts of infection spread dynamics that take specific accounts of the population and spatial heterogeneities that govern pathogen transmission in real-world communities. Developing such computational systems must also overcome the cold-start problem related to the inevitable scarce early data and extant knowledge regarding a novel pathogen’s transmissibility and virulence, while addressing changing population behavior and policy options as a pandemic evolves. Here, we describe how we have coupled advances in the construction of digital or virtual models of real-world cities with an agile, modular, agent-based model of viral transmission and data from navigation and social media interactions, to overcome these challenges in order to provide a new simulation tool, CitySEIRCast, that can model viral spread at the sub-national level. Our data pipelines and workflows are designed purposefully to be flexible and scalable so that we can implement the system on hybrid cloud/cluster systems and be agile enough to address different population settings and indeed, diseases. Our simulation results demonstrate that CitySEIRCast can provide the timely high resolution spatio-temporal epidemic predictions required for supporting situational awareness of the state of a pandemic as well as for facilitating assessments of vulnerable sub-populations and locations and evaluations of the impacts of implemented interventions, inclusive of the effects of population behavioral response to fluctuations in case incidence. This work arose in response to requests from county agencies to support their work on COVID-19 monitoring, risk assessment, and planning, and using the described workflows, we were able to provide uninterrupted bi-weekly simulations to guide their efforts for over a year from late 2021 to 2023. We discuss future work that can significantly improve the scalability and real-time application of this digital city-based epidemic modelling system, such that validated predictions and forecasts of the paths that may followed by a contagion both over time and space can be used to anticipate the spread dynamics, risky groups and regions, and options for responding effectively to a complex epidemic.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"1 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01695-7
Yunfan Zhang, Rong Zou, Yiqun Zhang, Yue Zhang, Yiu-ming Cheung, Kangshun Li
Heterogeneous attribute data (also called mixed data), characterized by attributes with numerical and categorical values, occur frequently across various scenarios. Since the annotation cost is high, clustering has emerged as a favorable technique for analyzing unlabeled mixed data. To address the complex real-world clustering task, this paper proposes a new clustering method called Adaptive Micro Partition and Hierarchical Merging (AMPHM) based on neighborhood rough set theory and a novel hierarchical merging mechanism. Specifically, we present a distance metric unified on numerical and categorical attributes to leverage neighborhood rough sets in partitioning data objects into fine-grained compact clusters. Then, we gradually merge the current most similar clusters to avoid incorporating dissimilar objects into a similar cluster. It turns out that the proposed approach breaks through the clustering performance bottleneck brought by the pre-set number of sought clusters k and cluster distribution bias, and is thus capable of clustering datasets comprising various combinations of numerical and categorical attributes. Extensive experimental evaluations comparing the proposed AMPHM with state-of-the-art counterparts on various datasets demonstrate its superiority.
{"title":"Adaptive micro partition and hierarchical merging for accurate mixed data clustering","authors":"Yunfan Zhang, Rong Zou, Yiqun Zhang, Yue Zhang, Yiu-ming Cheung, Kangshun Li","doi":"10.1007/s40747-024-01695-7","DOIUrl":"https://doi.org/10.1007/s40747-024-01695-7","url":null,"abstract":"<p>Heterogeneous attribute data (also called mixed data), characterized by attributes with numerical and categorical values, occur frequently across various scenarios. Since the annotation cost is high, clustering has emerged as a favorable technique for analyzing unlabeled mixed data. To address the complex real-world clustering task, this paper proposes a new clustering method called Adaptive Micro Partition and Hierarchical Merging (AMPHM) based on neighborhood rough set theory and a novel hierarchical merging mechanism. Specifically, we present a distance metric unified on numerical and categorical attributes to leverage neighborhood rough sets in partitioning data objects into fine-grained compact clusters. Then, we gradually merge the current most similar clusters to avoid incorporating dissimilar objects into a similar cluster. It turns out that the proposed approach breaks through the clustering performance bottleneck brought by the pre-set number of sought clusters <i>k</i> and cluster distribution bias, and is thus capable of clustering datasets comprising various combinations of numerical and categorical attributes. Extensive experimental evaluations comparing the proposed AMPHM with state-of-the-art counterparts on various datasets demonstrate its superiority.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"23 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-light object detection involves identifying and locating objects in images captured under poor lighting conditions. It plays a significant role in surveillance and security, night pedestrian recognition, and autonomous driving, showcasing broad application prospects. Most existing object detection algorithms and datasets are designed for normal lighting conditions, leading to a significant drop in detection performance when applied to low-light environments. To address this issue, we propose a Low-Light Detection with Low-Light Enhancement (LDWLE) framework. LDWLE is an encoder-decoder architecture where the encoder transforms the raw input data into a compact, abstract representation (encoding), and the decoder gradually generates the target output format from the representation produced by the encoder. Specifically, during training, low-light images are input into the encoder, which produces feature representations that are decoded by two separate decoders: an object detection decoder and a low-light image enhancement decoder. Both decoders share the same encoder and are trained jointly. Throughout the training process, the two decoders optimize each other, guiding the low-light image enhancement towards improvements that benefit object detection. If the input image is normally lit, it first passes through a low-light image conversion module to be transformed into a low-light image before being fed into the encoder. If the input image is already a low-light image, it is directly input into the encoder. During the testing phase, the model can be evaluated in the same way as a standard object detection algorithm. Compared to existing object detection algorithms, LDWLE can train a low-light robust object detection model using standard, normally lit object detection datasets. Additionally, LDWLE is a versatile training framework that can be implemented on most one-stage object detection algorithms. These algorithms typically consist of three components: the backbone, neck, and head. In this framework, the backbone functions as the encoder, while the neck and head form the object detection decoder. Extensive experiments on the COCO, VOC, and ExDark datasets have demonstrated the effectiveness of LDWLE in low-light object detection. In quantitative measurements, it achieves an AP of 25.5 and 38.4 on the synthetic datasets COCO-d and VOC-d, respectively, and achieves the best AP of 30.5 on the real-world dataset ExDark. In qualitative measurements, LDWLE can accurately detect most objects on both public real-world low-light datasets and self-collected ones, demonstrating strong adaptability to varying lighting conditions and multi-scale objects.
{"title":"LDWLE: self-supervised driven low-light object detection framework","authors":"Xiaoyang shen, Haibin Li, Yaqian Li, Wenming Zhang","doi":"10.1007/s40747-024-01681-z","DOIUrl":"https://doi.org/10.1007/s40747-024-01681-z","url":null,"abstract":"<p>Low-light object detection involves identifying and locating objects in images captured under poor lighting conditions. It plays a significant role in surveillance and security, night pedestrian recognition, and autonomous driving, showcasing broad application prospects. Most existing object detection algorithms and datasets are designed for normal lighting conditions, leading to a significant drop in detection performance when applied to low-light environments. To address this issue, we propose a Low-Light Detection with Low-Light Enhancement (LDWLE) framework. LDWLE is an encoder-decoder architecture where the encoder transforms the raw input data into a compact, abstract representation (encoding), and the decoder gradually generates the target output format from the representation produced by the encoder. Specifically, during training, low-light images are input into the encoder, which produces feature representations that are decoded by two separate decoders: an object detection decoder and a low-light image enhancement decoder. Both decoders share the same encoder and are trained jointly. Throughout the training process, the two decoders optimize each other, guiding the low-light image enhancement towards improvements that benefit object detection. If the input image is normally lit, it first passes through a low-light image conversion module to be transformed into a low-light image before being fed into the encoder. If the input image is already a low-light image, it is directly input into the encoder. During the testing phase, the model can be evaluated in the same way as a standard object detection algorithm. Compared to existing object detection algorithms, LDWLE can train a low-light robust object detection model using standard, normally lit object detection datasets. Additionally, LDWLE is a versatile training framework that can be implemented on most one-stage object detection algorithms. These algorithms typically consist of three components: the backbone, neck, and head. In this framework, the backbone functions as the encoder, while the neck and head form the object detection decoder. Extensive experiments on the COCO, VOC, and ExDark datasets have demonstrated the effectiveness of LDWLE in low-light object detection. In quantitative measurements, it achieves an AP of 25.5 and 38.4 on the synthetic datasets COCO-d and VOC-d, respectively, and achieves the best AP of 30.5 on the real-world dataset ExDark. In qualitative measurements, LDWLE can accurately detect most objects on both public real-world low-light datasets and self-collected ones, demonstrating strong adaptability to varying lighting conditions and multi-scale objects.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"12 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01691-x
Teng Fei, Ligong Bi, Jieming Gao, Shuixuan Chen, Guowei Zhang
With the advent of 3D Gaussian Splatting (3DGS), new and effective solutions have emerged for 3D reconstruction pipelines and scene representation. However, achieving high-fidelity reconstruction of complex scenes and capturing low-frequency features remain long-standing challenges in the field of visual 3D reconstruction. Relying solely on sparse point inputs and simple optimization criteria often leads to non-robust reconstructions of the radiance field, with reconstruction quality heavily dependent on the proper initialization of inputs. Notably, Multi-View Stereo (MVS) techniques offer a mature and reliable approach for generating structured point cloud data using a limited number of views, camera parameters, and feature matching. In this paper, we propose combining MVS with Gaussian Splatting, along with our newly introduced density optimization strategy, to address these challenges. This approach bridges the gap in scene representation by enhancing explicit geometry radiance fields with MVS, and our experimental results demonstrate its effectiveness. Additionally, we have explored the potential of using Gaussian Splatting for non-face template single-process end-to-end Avatar Reconstruction, yielding promising experimental results.
{"title":"MVSGS: Gaussian splatting radiation field enhancement using multi-view stereo","authors":"Teng Fei, Ligong Bi, Jieming Gao, Shuixuan Chen, Guowei Zhang","doi":"10.1007/s40747-024-01691-x","DOIUrl":"https://doi.org/10.1007/s40747-024-01691-x","url":null,"abstract":"<p>With the advent of 3D Gaussian Splatting (3DGS), new and effective solutions have emerged for 3D reconstruction pipelines and scene representation. However, achieving high-fidelity reconstruction of complex scenes and capturing low-frequency features remain long-standing challenges in the field of visual 3D reconstruction. Relying solely on sparse point inputs and simple optimization criteria often leads to non-robust reconstructions of the radiance field, with reconstruction quality heavily dependent on the proper initialization of inputs. Notably, Multi-View Stereo (MVS) techniques offer a mature and reliable approach for generating structured point cloud data using a limited number of views, camera parameters, and feature matching. In this paper, we propose combining MVS with Gaussian Splatting, along with our newly introduced density optimization strategy, to address these challenges. This approach bridges the gap in scene representation by enhancing explicit geometry radiance fields with MVS, and our experimental results demonstrate its effectiveness. Additionally, we have explored the potential of using Gaussian Splatting for non-face template single-process end-to-end Avatar Reconstruction, yielding promising experimental results.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"54 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01641-7
Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen
The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA.
{"title":"T-LLaMA: a Tibetan large language model based on LLaMA2","authors":"Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen","doi":"10.1007/s40747-024-01641-7","DOIUrl":"https://doi.org/10.1007/s40747-024-01641-7","url":null,"abstract":"<p>The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"10 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142849193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01687-7
Quan Wang, Guangfei Ye, Qidong Chen, Songyang Zhang, Fengqing Wang
Vehicle detection and tracking from a UAV perspective often encounters omission and misdetection due to the small targets, complex scenes and target occlusion, which finally influences hugely on detection accuracy and target tracking stability. Additionally, the number of parameters of current model is large that makes it is hard to be deployed on mobile devices. Therefore, this paper proposes a YOLO-LMP and NGCTrack-based target detection and tracking algorithm to address these issues. Firstly, the performance of detecting small targets in occluded scenes is enhanced by adding a MODConv to the small-target detection head and increasing its size; In addition, excessive deletion of prediction boxes is prevented by utilizing LSKAttention mechanism to adaptively adjust the target sensing field at the downsampling stage and combining it with the Soft-NMS strategy. Furthermore, the C2f module is replaced by the FPW to reduce the pointless computation and memory utilization of the model. At the target tracking stage, the so-called NGCTrack in our algorithm replaces IOU with GIOU and employs a modified NSA Kalman filter to adjust the state-space aspect ratio for width prediction. Finally, the camera adjustment mechanism was introduced to improve the precision and consistency of tracking. The experimental results show that, compared to YOLOv8, the YOLO-LMP model improves map50 and map50:95 metrics by 10.3 and 12.2%, respectively and the number of parameters is decreased by 47.7%. After combined it with the improved NGCTrack, the number of IDSW reduced by 73.6% compared to the ByteTrack method, while the MOTA and IDF1 increase by 5.2 and 9.8%, respectively.
{"title":"A UAV perspective based lightweight target detection and tracking algorithm for intelligent transportation","authors":"Quan Wang, Guangfei Ye, Qidong Chen, Songyang Zhang, Fengqing Wang","doi":"10.1007/s40747-024-01687-7","DOIUrl":"https://doi.org/10.1007/s40747-024-01687-7","url":null,"abstract":"<p>Vehicle detection and tracking from a UAV perspective often encounters omission and misdetection due to the small targets, complex scenes and target occlusion, which finally influences hugely on detection accuracy and target tracking stability. Additionally, the number of parameters of current model is large that makes it is hard to be deployed on mobile devices. Therefore, this paper proposes a YOLO-LMP and NGCTrack-based target detection and tracking algorithm to address these issues. Firstly, the performance of detecting small targets in occluded scenes is enhanced by adding a MODConv to the small-target detection head and increasing its size; In addition, excessive deletion of prediction boxes is prevented by utilizing LSKAttention mechanism to adaptively adjust the target sensing field at the downsampling stage and combining it with the Soft-NMS strategy. Furthermore, the C2f module is replaced by the FPW to reduce the pointless computation and memory utilization of the model. At the target tracking stage, the so-called NGCTrack in our algorithm replaces IOU with GIOU and employs a modified NSA Kalman filter to adjust the state-space aspect ratio for width prediction. Finally, the camera adjustment mechanism was introduced to improve the precision and consistency of tracking. The experimental results show that, compared to YOLOv8, the YOLO-LMP model improves map50 and map50:95 metrics by 10.3 and 12.2%, respectively and the number of parameters is decreased by 47.7%. After combined it with the improved NGCTrack, the number of IDSW reduced by 73.6% compared to the ByteTrack method, while the MOTA and IDF1 increase by 5.2 and 9.8%, respectively.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"54 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142849195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1007/s40747-024-01693-9
Yanzhan Chen, Qian Zhang, Fan Yu
The daily occurrence of traffic accidents has led to the development of 3D reconstruction as a key tool for reconstruction, investigation, and insurance claims. This study proposes a novel virtual-real-fusion simulation framework that integrates traffic accident generation, unmanned aerial vehicle (UAV)-based image collection, and a 3D traffic accident reconstruction pipeline with advanced computer vision techniques and unsupervised 3D point cloud clustering algorithms. Specifically, a micro-traffic simulator and an autonomous driving simulator are co-simulated to generate high-fidelity traffic accidents. Subsequently, a deep learning-based reconstruction method, i.e., 3D Gaussian splatting (3D-GS), is utilized to construct 3D digitized traffic accident scenes from UAV-based image datasets collected in the traffic simulation environment. While visual rendering by 3D-GS struggles under adverse conditions like nighttime or rain, a clustering parameter stochastic optimization model and mixed-integer programming Bayesian optimization (MIPBO) algorithm are proposed to enhance the segmentation of large-scale 3D point clouds. In the numerical experiments, 3D-GS produces high-quality, seamless, and real-time rendered traffic accident scenes achieve a structural similarity index measure of up to 0.90 across different towns. Furthermore, the proposed MIPDBO algorithm exhibits a remarkably fast convergence rate, requiring only 3–5 iterations to identify well-performing parameters and achieve a high ({R}^{2}) value of 0.8 on a benchmark cluster problem. Finally, the Gaussian Mixture Model assisted by MIPBO accurately separates various traffic elements in the accident scenes, demonstrating higher effectiveness compared to other classical clustering algorithms.
{"title":"Transforming traffic accident investigations: a virtual-real-fusion framework for intelligent 3D traffic accident reconstruction","authors":"Yanzhan Chen, Qian Zhang, Fan Yu","doi":"10.1007/s40747-024-01693-9","DOIUrl":"https://doi.org/10.1007/s40747-024-01693-9","url":null,"abstract":"<p>The daily occurrence of traffic accidents has led to the development of 3D reconstruction as a key tool for reconstruction, investigation, and insurance claims. This study proposes a novel virtual-real-fusion simulation framework that integrates traffic accident generation, unmanned aerial vehicle (UAV)-based image collection, and a 3D traffic accident reconstruction pipeline with advanced computer vision techniques and unsupervised 3D point cloud clustering algorithms. Specifically, a micro-traffic simulator and an autonomous driving simulator are co-simulated to generate high-fidelity traffic accidents. Subsequently, a deep learning-based reconstruction method, i.e., 3D Gaussian splatting (3D-GS), is utilized to construct 3D digitized traffic accident scenes from UAV-based image datasets collected in the traffic simulation environment. While visual rendering by 3D-GS struggles under adverse conditions like nighttime or rain, a clustering parameter stochastic optimization model and mixed-integer programming Bayesian optimization (MIPBO) algorithm are proposed to enhance the segmentation of large-scale 3D point clouds. In the numerical experiments, 3D-GS produces high-quality, seamless, and real-time rendered traffic accident scenes achieve a structural similarity index measure of up to 0.90 across different towns. Furthermore, the proposed MIPDBO algorithm exhibits a remarkably fast convergence rate, requiring only 3–5 iterations to identify well-performing parameters and achieve a high <span>({R}^{2})</span> value of 0.8 on a benchmark cluster problem. Finally, the Gaussian Mixture Model assisted by MIPBO accurately separates various traffic elements in the accident scenes, demonstrating higher effectiveness compared to other classical clustering algorithms.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"27 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}