Computational Intelligence最新文献_第3页

An Automated Recommendation System for Crowdsourcing Data Using Improved Heuristic-Aided Residual Long Short-Term Memory

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2025-01-06 DOI: 10.1111/coin.70017

K. Dhinakaran, R. Nedunchelian

In recent years, crowdsourcing has developed into a business production paradigm and a distributed problem-solving platform. However, the conventional machine learning models failed to assist both requesters and workers in finding the proper jobs that affect better quality outputs. The traditional large-scale crowdsourcing systems typically involve a lot of microtasks, and it requires more time for a crowdworker to search a work on this platform. Thus, task suggestion methods are more useful. Yet, the traditional approaches do not consider the cold-start issue. To tackle these issues, in this paper, a new recommendation system for crowdsourcing data is implemented utilizing deep learning. Initially, from the standard online sources, the crowdsourced data are accumulated. The novelty of the model is to propose an adaptive residual long short-term memory (ARes-LSTM) that learns the task's latent factor via the task features rather than the task ID. Here, this network's parameters are optimized by the fitness-based drawer algorithm (F-DA) to improve the efficacy rates. Further, the suggested ARes-LSTM is adopted to detect the user's preference score based on the user's historical behaviors. According to the historical behavior records of the users and task features, the ARes-LSTM provides personalized task recommendations and rectifies the issue of cold-start. From the outcomes, the better accuracy rate of the implemented model is 91.42857. Consequently, the accuracy rate of the traditional techniques such as AOA, TSA, BBRO, and DA is attained as 84.07, 85.42, 87.07, and 90.07. Finally, the simulation of the implemented recommendation system is conducted with various traditional techniques with standard efficiency metrics to show the supremacy of the designed recommendation system. Thus, it is proved that the developed recommendation system for the crowdsourcing data model chooses intended tasks based on individual preferences that can help to enlarge the number of chances to engage in crowdsourcing efforts across a broad range of platforms.

{"title":"An Automated Recommendation System for Crowdsourcing Data Using Improved Heuristic-Aided Residual Long Short-Term Memory","authors":"K. Dhinakaran, R. Nedunchelian","doi":"10.1111/coin.70017","DOIUrl":"https://doi.org/10.1111/coin.70017","url":null,"abstract":"<div>\u0000 \u0000 <p>In recent years, crowdsourcing has developed into a business production paradigm and a distributed problem-solving platform. However, the conventional machine learning models failed to assist both requesters and workers in finding the proper jobs that affect better quality outputs. The traditional large-scale crowdsourcing systems typically involve a lot of microtasks, and it requires more time for a crowdworker to search a work on this platform. Thus, task suggestion methods are more useful. Yet, the traditional approaches do not consider the cold-start issue. To tackle these issues, in this paper, a new recommendation system for crowdsourcing data is implemented utilizing deep learning. Initially, from the standard online sources, the crowdsourced data are accumulated. The novelty of the model is to propose an adaptive residual long short-term memory (ARes-LSTM) that learns the task's latent factor via the task features rather than the task ID. Here, this network's parameters are optimized by the fitness-based drawer algorithm (F-DA) to improve the efficacy rates. Further, the suggested ARes-LSTM is adopted to detect the user's preference score based on the user's historical behaviors. According to the historical behavior records of the users and task features, the ARes-LSTM provides personalized task recommendations and rectifies the issue of cold-start. From the outcomes, the better accuracy rate of the implemented model is 91.42857. Consequently, the accuracy rate of the traditional techniques such as AOA, TSA, BBRO, and DA is attained as 84.07, 85.42, 87.07, and 90.07. Finally, the simulation of the implemented recommendation system is conducted with various traditional techniques with standard efficiency metrics to show the supremacy of the designed recommendation system. Thus, it is proved that the developed recommendation system for the crowdsourcing data model chooses intended tasks based on individual preferences that can help to enlarge the number of chances to engage in crowdsourcing efforts across a broad range of platforms.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"41 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143112656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2025-01-06 DOI: 10.1111/coin.70016

Chaitanya Jannu, Manaswini Burra, Sunny Dayal Vanambathina, Veeraswamy Parisae

Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.

{"title":"Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN","authors":"Chaitanya Jannu, Manaswini Burra, Sunny Dayal Vanambathina, Veeraswamy Parisae","doi":"10.1111/coin.70016","DOIUrl":"https://doi.org/10.1111/coin.70016","url":null,"abstract":"<div>\u0000 \u0000 <p>Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"41 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143112659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer-Based Approaches and Late Fusion Strategies 用于增强自动语音识别的Mel谱图和文本文本的多模态集成：利用基于提取转换器的方法和后期融合策略

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-19 DOI: 10.1111/coin.70012

Sunakshi Mehra, Virender Ranga, Ritu Agarwal

This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel Spectrograms (2D images) obtained from raw audio. This study explores the less-explored potential of spectrograms and linguistic information in enhancing spoken word recognition accuracy. To elevate ASR performance, we propose two distinct transformer-based approaches: First, for the audio-centric approach, we leverage RegNet and ConvNeXt architectures, initially trained on a massive dataset of 14 million annotated images from ImageNet, to process Mel Spectrograms as image inputs. Second, we harness the Speech2Text transformer to decouple text transcript acquisition from raw audio. We pre-process Mel Spectrogram images, resizing them to 224 × 224 pixels to create two-dimensional audio representations. ImageNet, RegNet, and ConvNeXt individually categorize these images. The first channel generates the embeddings for visual modalities (RegNet and ConvNeXt) on 2D Mel Spectrograms. Additionally, we employ Sentence-BERT embeddings via Siamese BERT networks to transform Speech2Text transcripts into vectors. These image embeddings, along with Sentence-BERT embeddings from speech transcription, are subsequently fine-tuned within a deep dense model with five layers and batch normalization for spoken word classification. Our experiments focus on the Google Speech Command Dataset (GSCD) version 2, encompassing 35-word categories. To gauge the impact of spectrograms and linguistic features, we conducted an ablation analysis. Our novel late fusion strategy unites word embeddings and image embeddings, resulting in remarkable test accuracy rates of 95.87% for ConvNeXt, 99.95% for RegNet, and 85.93% for text transcripts across the 35-word categories, as processed by the deep dense layered model with Batch Normalization. We obtained a test accuracy of 99.96% for 35-word categories after using the late fusion of ConvNeXt + RegNet + SBERT, demonstrating superior results compared to other state-of-the-art methods.

本研究旨在通过创新地整合多模态数据，特别是从原始音频中获得的文本文本和Mel谱图（2D图像），推进自动语音识别（ASR）领域的发展。本研究探讨了频谱图和语言信息在提高口语识别准确性方面的潜力。为了提高ASR性能，我们提出了两种不同的基于转换器的方法：首先，对于以音频为中心的方法，我们利用RegNet和ConvNeXt架构，最初在ImageNet的1400万张注释图像的大规模数据集上进行训练，以处理Mel谱图作为图像输入。其次，我们利用Speech2Text转换器将文本文本采集与原始音频解耦。我们预处理Mel谱图图像，将其大小调整为224 × 224像素，以创建二维音频表示。ImageNet、RegNet和ConvNeXt分别对这些图像进行分类。第一个通道在二维Mel谱图上生成视觉模态（RegNet和ConvNeXt）的嵌入。此外，我们通过Siamese BERT网络使用句子BERT嵌入将语音文本文本转换为向量。这些图像嵌入，以及来自语音转录的句子bert嵌入，随后在一个具有五层的深度密集模型中进行微调，并对口语单词分类进行批处理归一化。我们的实验集中在谷歌语音命令数据集（GSCD）版本2上，包含35个词的类别。为了衡量谱图和语言特征的影响，我们进行了消融分析。我们新颖的后期融合策略将词嵌入和图像嵌入结合起来，在使用批处理归一化的深度密集分层模型处理的35个词类别的文本文本中，ConvNeXt的测试准确率为95.87%，RegNet为99.95%,85.93%。在使用ConvNeXt + RegNet + SBERT后期融合后，我们获得了35个单词类别的测试准确率为99.96%，与其他最先进的方法相比，显示出优越的结果。

{"title":"Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer-Based Approaches and Late Fusion Strategies","authors":"Sunakshi Mehra, Virender Ranga, Ritu Agarwal","doi":"10.1111/coin.70012","DOIUrl":"https://doi.org/10.1111/coin.70012","url":null,"abstract":"<div>\u0000 \u0000 <p>This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel Spectrograms (2D images) obtained from raw audio. This study explores the less-explored potential of spectrograms and linguistic information in enhancing spoken word recognition accuracy. To elevate ASR performance, we propose two distinct transformer-based approaches: First, for the audio-centric approach, we leverage RegNet and ConvNeXt architectures, initially trained on a massive dataset of 14 million annotated images from ImageNet, to process Mel Spectrograms as image inputs. Second, we harness the Speech2Text transformer to decouple text transcript acquisition from raw audio. We pre-process Mel Spectrogram images, resizing them to 224 × 224 pixels to create two-dimensional audio representations. ImageNet, RegNet, and ConvNeXt individually categorize these images. The first channel generates the embeddings for visual modalities (RegNet and ConvNeXt) on 2D Mel Spectrograms. Additionally, we employ Sentence-BERT embeddings via Siamese BERT networks to transform Speech2Text transcripts into vectors. These image embeddings, along with Sentence-BERT embeddings from speech transcription, are subsequently fine-tuned within a deep dense model with five layers and batch normalization for spoken word classification. Our experiments focus on the Google Speech Command Dataset (GSCD) version 2, encompassing 35-word categories. To gauge the impact of spectrograms and linguistic features, we conducted an ablation analysis. Our novel late fusion strategy unites word embeddings and image embeddings, resulting in remarkable test accuracy rates of 95.87% for ConvNeXt, 99.95% for RegNet, and 85.93% for text transcripts across the 35-word categories, as processed by the deep dense layered model with Batch Normalization. We obtained a test accuracy of 99.96% for 35-word categories after using the late fusion of ConvNeXt + RegNet + SBERT, demonstrating superior results compared to other state-of-the-art methods.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142861814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SHREA: A Systematic Hybrid Resampling Ensemble Approach Using One Class Classifier 基于单类分类器的系统混合重采样集成方法

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-19 DOI: 10.1111/coin.70004

Pranita Baro, Malaya Dutta Borah

Imbalanced classification and data incompleteness are two critical issues in machine learning that, despite significant research, are difficult to solve. This paper presents the Systematic Hybrid Resampling Ensemble Approach that deals with the class imbalance and incompleteness of data at a given dataset and improves classification performance. We use an oscillator-guided Factor Based Multiple Imputation Oversampling technique to balance out the minority and majority data samples, while substituting missing values in the dataset. The improved dataset is an oversampled dataset and it goes through random undersample to create majority and minority class subsets. These subsets are then trained with the classifiers using one of the One Class Classifier-based methods, that is, One Class Support Vector Machine or Local Outlier Factor. Lastly, bootstrap aggregation ensemble setups are done using majority and minority class classifiers and combining them to come up with a score-based prediction. To mimic real-life scenarios where data could be missing, we introduce random missing values on each of these imbalance datasets to create $� � � 3 � � �$ new sets from each dataset with different missing values, that is, (10%, 20%, and 30%). The proposed method is experimented with using datasets taken from the KEEL website, and the results are compared against RBG, SBG, SBT, DTE, and EUS. Experimental analysis shows that the proposed approach gives better results revealing the efficiency and significance compared to the existing methods. The proposed method Local Outlier Factor Systematic Hybrid Resampling Ensemble Approach improves by 3.46%, 5.30%, 10.51% and 9.26% in terms of Recall, AUC, f-measure and g-mean and One Class Support Vector Machine Systematic Hybrid Resampling Ensemble Approach by 4.82%, 5.95%, 11.03% and 8.80% respectively.

分类不平衡和数据不完整是机器学习中的两个关键问题，尽管有大量的研究，但很难解决。提出了一种系统混合重采样集成方法，该方法处理了给定数据集中数据的类不平衡和不完备性，提高了分类性能。我们使用振荡器引导的基于因子的多重插值过采样技术来平衡少数和多数数据样本，同时替换数据集中的缺失值。改进的数据集是一个过采样数据集，它通过随机欠采样来创建多数和少数类子集。然后使用基于一类分类器的方法之一（即一类支持向量机或局部离群因子）对这些子集进行训练。最后，使用多数类和少数类分类器完成自举聚合集成设置，并将它们组合起来以提出基于分数的预测。为了模拟数据可能丢失的现实场景，我们在每个不平衡数据集上引入随机缺失值，以从每个数据集创建3个具有不同缺失值的$$ 3 $$新集，即（10）%, 20%, and 30%). The proposed method is experimented with using datasets taken from the KEEL website, and the results are compared against RBG, SBG, SBT, DTE, and EUS. Experimental analysis shows that the proposed approach gives better results revealing the efficiency and significance compared to the existing methods. The proposed method Local Outlier Factor Systematic Hybrid Resampling Ensemble Approach improves by 3.46%, 5.30%, 10.51% and 9.26% in terms of Recall, AUC, f-measure and g-mean and One Class Support Vector Machine Systematic Hybrid Resampling Ensemble Approach by 4.82%, 5.95%, 11.03% and 8.80% respectively.

{"title":"SHREA: A Systematic Hybrid Resampling Ensemble Approach Using One Class Classifier","authors":"Pranita Baro, Malaya Dutta Borah","doi":"10.1111/coin.70004","DOIUrl":"https://doi.org/10.1111/coin.70004","url":null,"abstract":"<div>\u0000 \u0000 <p>Imbalanced classification and data incompleteness are two critical issues in machine learning that, despite significant research, are difficult to solve. This paper presents the Systematic Hybrid Resampling Ensemble Approach that deals with the class imbalance and incompleteness of data at a given dataset and improves classification performance. We use an oscillator-guided Factor Based Multiple Imputation Oversampling technique to balance out the minority and majority data samples, while substituting missing values in the dataset. The improved dataset is an oversampled dataset and it goes through random undersample to create majority and minority class subsets. These subsets are then trained with the classifiers using one of the One Class Classifier-based methods, that is, One Class Support Vector Machine or Local Outlier Factor. Lastly, bootstrap aggregation ensemble setups are done using majority and minority class classifiers and combining them to come up with a score-based prediction. To mimic real-life scenarios where data could be missing, we introduce random missing values on each of these imbalance datasets to create <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mn>3</mn>\u0000 </mrow>\u0000 <annotation>$$ 3 $$</annotation>\u0000 </semantics></math> new sets from each dataset with different missing values, that is, (10%, 20%, and 30%). The proposed method is experimented with using datasets taken from the KEEL website, and the results are compared against RBG, SBG, SBT, DTE, and EUS. Experimental analysis shows that the proposed approach gives better results revealing the efficiency and significance compared to the existing methods. The proposed method Local Outlier Factor Systematic Hybrid Resampling Ensemble Approach improves by 3.46%, 5.30%, 10.51% and 9.26% in terms of Recall, AUC, f-measure and g-mean and One Class Support Vector Machine Systematic Hybrid Resampling Ensemble Approach by 4.82%, 5.95%, 11.03% and 8.80% respectively.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142861811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Automated Histopathological Colorectal Cancer Multi-Class Classification System Based on Optimal Image Processing and Prominent Features 基于最佳图像处理和突出特征的结直肠癌组织病理学多级自动分类系统

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-18 DOI: 10.1111/coin.70007

Tasnim Jahan Tonni, Shakil Rana, Kaniz Fatema, Asif Karim, Md. Awlad Hossen Rony, Md. Zahid Hasan, Md. Saddam Hossain Mukta, Sami Azam

Colorectal cancer (CRC) is characterized by the uncontrollable growth of cancerous cells within the rectal mucosa. In contrast, colon polyps, precancerous growths, can develop into colon cancer, causing symptoms like rectal bleeding, abdominal pain, diarrhea, weight loss, and constipation. It is the leading cause of death worldwide, and this potentially fatal cancer severely afflicts the elderly. Furthermore, early diagnosis is crucial for effective treatment, as it is often more time-consuming and laborious for experts. This study improved the accuracy of CRC multi-class classification compared to previous research utilizing diverse datasets, such as NCT-CRC-HE-100 K (100,000 images) and CRC-VAL-HE-7 K (7,180 images). Initially, we utilized various image processing techniques on the NCT-CRC-HE-100 K dataset to improve image quality and noise-freeness, followed by multiple feature extraction and selection methods to identify prominent features from a large data hub and experimenting with different approaches to select the best classifiers for these critical features. The third ensemble model (XGB-LightGBM-RF) achieved an optimum accuracy of 99.63% with 40 prominent features using univariate feature selection methods. Moreover, the third ensemble model also achieved 99.73% accuracy from the CRC-VAL-HE-7 K dataset. After combining two datasets, the third ensemble model achieved 99.27% accuracy. In addition, we trained and tested our model with two different datasets. We used 80% data from NCT-CRC-HE-100 K and 20% data from CRC-VAL-HE-7 K, respectively, for training and testing purposes, while the third ensemble model obtained 98.43% accuracy in multi-class classification. The results show that this new framework, which was created using the third ensemble model, can help experts figure out what kinds of CRC diseases people are dealing with at the very beginning of an investigation.

结直肠癌（CRC）的特点是直肠粘膜内癌细胞的不可控生长。相反，结肠息肉，癌前病变，可以发展成结肠癌，引起直肠出血、腹痛、腹泻、体重减轻和便秘等症状。它是世界范围内导致死亡的主要原因，这种潜在的致命癌症严重折磨着老年人。此外，早期诊断对于有效治疗至关重要，因为专家通常更费时费力。与以往利用NCT-CRC-HE-100 K（10万张图像）和CRC- val - he -7 K（7180张图像）等不同数据集的研究相比，本研究提高了CRC多类分类的准确性。首先，我们在nct - crc - he - 100k数据集上使用了各种图像处理技术来提高图像质量和无噪声性，然后使用多种特征提取和选择方法来识别大型数据中心的突出特征，并尝试使用不同的方法来选择这些关键特征的最佳分类器。第三个集成模型（XGB-LightGBM-RF）采用单变量特征选择方法，具有40个突出特征，准确率达到99.63%。此外，在crc - val - he - 7k数据集上，第三个集成模型的准确率也达到了99.73%。结合两个数据集后，第三个集成模型的准确率达到99.27%。此外，我们用两个不同的数据集训练和测试了我们的模型。我们分别使用80%来自NCT-CRC-HE-100 K和20%来自CRC-VAL-HE-7 K的数据进行训练和测试，而第三个集成模型在多类分类中获得了98.43%的准确率。结果表明，这个使用第三个集成模型创建的新框架可以帮助专家在调查开始时弄清楚人们正在处理的结直肠癌疾病类型。

{"title":"An Automated Histopathological Colorectal Cancer Multi-Class Classification System Based on Optimal Image Processing and Prominent Features","authors":"Tasnim Jahan Tonni, Shakil Rana, Kaniz Fatema, Asif Karim, Md. Awlad Hossen Rony, Md. Zahid Hasan, Md. Saddam Hossain Mukta, Sami Azam","doi":"10.1111/coin.70007","DOIUrl":"https://doi.org/10.1111/coin.70007","url":null,"abstract":"<div>\u0000 \u0000 <p>Colorectal cancer (CRC) is characterized by the uncontrollable growth of cancerous cells within the rectal mucosa. In contrast, colon polyps, precancerous growths, can develop into colon cancer, causing symptoms like rectal bleeding, abdominal pain, diarrhea, weight loss, and constipation. It is the leading cause of death worldwide, and this potentially fatal cancer severely afflicts the elderly. Furthermore, early diagnosis is crucial for effective treatment, as it is often more time-consuming and laborious for experts. This study improved the accuracy of CRC multi-class classification compared to previous research utilizing diverse datasets, such as NCT-CRC-HE-100 K (100,000 images) and CRC-VAL-HE-7 K (7,180 images). Initially, we utilized various image processing techniques on the NCT-CRC-HE-100 K dataset to improve image quality and noise-freeness, followed by multiple feature extraction and selection methods to identify prominent features from a large data hub and experimenting with different approaches to select the best classifiers for these critical features. The third ensemble model (XGB-LightGBM-RF) achieved an optimum accuracy of 99.63% with 40 prominent features using univariate feature selection methods. Moreover, the third ensemble model also achieved 99.73% accuracy from the CRC-VAL-HE-7 K dataset. After combining two datasets, the third ensemble model achieved 99.27% accuracy. In addition, we trained and tested our model with two different datasets. We used 80% data from NCT-CRC-HE-100 K and 20% data from CRC-VAL-HE-7 K, respectively, for training and testing purposes, while the third ensemble model obtained 98.43% accuracy in multi-class classification. The results show that this new framework, which was created using the third ensemble model, can help experts figure out what kinds of CRC diseases people are dealing with at the very beginning of an investigation.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142861764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mining User Study Data to Judge the Merit of a Model for Supporting User-Specific Explanations of AI Systems 挖掘用户研究数据以判断支持AI系统的用户特定解释的模型的优点

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-17 DOI: 10.1111/coin.70015

Owen Chambers, Robin Cohen, Maura R. Grossman, Liam Hebert, Elias Awad

In this paper, we present a model for supporting user-specific explanations of AI systems. We then discuss a user study that was conducted to gauge whether the decisions for adjusting output to users with certain characteristics was confirmed to be of value to participants. We focus on the merit of having explanations attuned to particular psychological profiles of users, and the value of having different options for the level of explanation that is offered (including allowing for no explanation, as one possibility). Following the description of the study, we present an approach for mining data from user participant responses in order to determine whether the model that was developed for varying the output to users was well-founded. While our results in this respect are preliminary, we explain how using varied machine learning methods is of value as a concrete step toward validation of specific approaches for AI explanation. We conclude with a discussion of related work and some ideas for new directions with the research, in the future.

在本文中，我们提出了一个支持用户特定的AI系统解释的模型。然后，我们讨论了一项用户研究，该研究旨在衡量对具有某些特征的用户调整输出的决策是否被证实对参与者有价值。我们关注的是根据用户的特定心理特征提供解释的优点，以及为提供的解释级别提供不同选择的价值（包括允许不解释，作为一种可能性）。根据研究的描述，我们提出了一种从用户参与者的回答中挖掘数据的方法，以确定为改变用户输出而开发的模型是否有充分的基础。虽然我们在这方面的结果是初步的，但我们解释了如何使用各种机器学习方法作为验证人工智能解释特定方法的具体步骤是有价值的。最后对相关工作进行了讨论，并对今后的研究方向提出了一些设想。

{"title":"Mining User Study Data to Judge the Merit of a Model for Supporting User-Specific Explanations of AI Systems","authors":"Owen Chambers, Robin Cohen, Maura R. Grossman, Liam Hebert, Elias Awad","doi":"10.1111/coin.70015","DOIUrl":"https://doi.org/10.1111/coin.70015","url":null,"abstract":"<p>In this paper, we present a model for supporting user-specific explanations of AI systems. We then discuss a user study that was conducted to gauge whether the decisions for adjusting output to users with certain characteristics was confirmed to be of value to participants. We focus on the merit of having explanations attuned to particular psychological profiles of users, and the value of having different options for the level of explanation that is offered (including allowing for no explanation, as one possibility). Following the description of the study, we present an approach for mining data from user participant responses in order to determine whether the model that was developed for varying the output to users was well-founded. While our results in this respect are preliminary, we explain how using varied machine learning methods is of value as a concrete step toward validation of specific approaches for AI explanation. We conclude with a discussion of related work and some ideas for new directions with the research, in the future.</p>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/coin.70015","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142861637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Resan: A Residual Dual-Attention Network for Abnormal Cardiac Activity Detection Resan：用于检测异常心脏活动的残余双注意力网络

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-17 DOI: 10.1111/coin.70005

Xuhui Wang, Yuanyuan Zhu, Fei Wu, Long Gao, Datun Qi, Xiaoyuan Jing, Chong Luo

Cardiovascular disease is one of the leading causes of death worldwide. Early and accurate detection of abnormal cardiac activity can be an effective way to prevent serious cardiovascular events. Electrocardiogram (ECG) and phonocardiogram (PCG) signals provide an objective evaluation of the heart's electrical and acoustic functions, enabling medical professionals to make an accurate diagnosis. Therefore, the cardiologists often use them to make a preliminary diagnosis of abnormal cardiac activity in clinical practice. For this reason, many diagnostic models have been proposed. However, these models fail to utilize the interaction information within and between the signals to aid the diagnosis of disease. To address this issue, we designed a residual dual-attention network (ResAN) for the detection of abnormal cardiac activity using synchronized ECG and PCG signals. First, ResAN uses a feature learning module with two parallel residual networks, for example, ECG-ResNet and PCG-ResNet to automatically learn the deep modal-specific features from the ECG and PCG sequences, respectively. Second, to fully utilize the available information of different modal signals, ResAN uses a dual-attention fusion module to capture the salient features of the integrated ECG and PCG features learned by the feature learning module, as well as the alternating features between them based on the attention mechanisms. Finally, these fused features are merged and fed to the classification module to detect abnormal cardiac activity. Our model achieves an accuracy of 96.1%, surpassing the performances of comparison models by 1.0% to 9.9% when using synchronized ECG and PCG signals. Furthermore, the ablation study confirmed the efficacy of the components in ResAN and also showed that ResAN performs better with synchronized ECG and PCG signals compared to using single-modal signals. Overall, ResAN provides a valid solution for the early detection of abnormal cardiac activity using ECG and PCG signals.

心血管疾病是世界范围内死亡的主要原因之一。早期准确发现心脏异常活动是预防严重心血管事件的有效途径。心电图（ECG）和心音图（PCG）信号提供了对心脏电和声学功能的客观评估，使医疗专业人员能够做出准确的诊断。因此，在临床实践中，心脏科医生经常使用它们对心脏异常活动进行初步诊断。因此，提出了许多诊断模型。然而，这些模型不能利用信号内部和信号之间的相互作用信息来帮助疾病的诊断。为了解决这个问题，我们设计了一个残差双注意网络（ResAN），利用同步的ECG和PCG信号来检测异常的心脏活动。首先，ResAN使用两个并行残差网络（ECG- resnet和PCG- resnet）的特征学习模块，分别从ECG和PCG序列中自动学习深度模态特征。其次，为了充分利用不同模态信号的可用信息，ResAN使用双注意融合模块捕捉特征学习模块学习到的ECG和PCG特征的显著特征，以及基于注意机制的ECG和PCG特征之间的交替特征。最后，将这些融合后的特征进行融合并输入到分类模块中，检测出异常的心脏活动。我们的模型达到了96.1%的准确率，在使用同步ECG和PCG信号时，比比较模型的性能高出1.0%至9.9%。此外，消融研究证实了ResAN中各成分的有效性，并表明与使用单模态信号相比，ResAN在同步ECG和PCG信号下表现更好。总的来说，ResAN为利用ECG和PCG信号早期检测异常心脏活动提供了有效的解决方案。

{"title":"Resan: A Residual Dual-Attention Network for Abnormal Cardiac Activity Detection","authors":"Xuhui Wang, Yuanyuan Zhu, Fei Wu, Long Gao, Datun Qi, Xiaoyuan Jing, Chong Luo","doi":"10.1111/coin.70005","DOIUrl":"https://doi.org/10.1111/coin.70005","url":null,"abstract":"<div>\u0000 \u0000 <p>Cardiovascular disease is one of the leading causes of death worldwide. Early and accurate detection of abnormal cardiac activity can be an effective way to prevent serious cardiovascular events. Electrocardiogram (ECG) and phonocardiogram (PCG) signals provide an objective evaluation of the heart's electrical and acoustic functions, enabling medical professionals to make an accurate diagnosis. Therefore, the cardiologists often use them to make a preliminary diagnosis of abnormal cardiac activity in clinical practice. For this reason, many diagnostic models have been proposed. However, these models fail to utilize the interaction information within and between the signals to aid the diagnosis of disease. To address this issue, we designed a residual dual-attention network (ResAN) for the detection of abnormal cardiac activity using synchronized ECG and PCG signals. First, ResAN uses a feature learning module with two parallel residual networks, for example, ECG-ResNet and PCG-ResNet to automatically learn the deep modal-specific features from the ECG and PCG sequences, respectively. Second, to fully utilize the available information of different modal signals, ResAN uses a dual-attention fusion module to capture the salient features of the integrated ECG and PCG features learned by the feature learning module, as well as the alternating features between them based on the attention mechanisms. Finally, these fused features are merged and fed to the classification module to detect abnormal cardiac activity. Our model achieves an accuracy of 96.1%, surpassing the performances of comparison models by 1.0% to 9.9% when using synchronized ECG and PCG signals. Furthermore, the ablation study confirmed the efficacy of the components in ResAN and also showed that ResAN performs better with synchronized ECG and PCG signals compared to using single-modal signals. Overall, ResAN provides a valid solution for the early detection of abnormal cardiac activity using ECG and PCG signals.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142861639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A ViT-Based Adaptive Recurrent Mobilenet With Attention Network for Video Compression and Bit-Rate Reduction Using Improved Heuristic Approach Under Versatile Video Coding 多用途视频编码下基于改进启发式算法的视频压缩和降码自适应循环注意网络

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-10 DOI: 10.1111/coin.70014

D. Padmapriya, Ameelia Roseline A

Video compression received attention from the communities of video processing and deep learning. Modern learning-aided mechanisms use a hybrid coding approach to reduce redundancy in pixel space across time and space, improving motion compensation accuracy. The experiments in video compression have important improvements in past years. The Versatile Video Coding (VVC) is the primary enhancing standard of video compression which is also referred to as H. 226. The VVC codec is a block-assisted hybrid codec, making it highly capable and complex. Video coding effectively compresses data while reducing compression artifacts, enhancing the quality and functionality of AI video technologies. However, the traditional models suffer from the incorrect compression of the motion and ineffective compensation frameworks of the motion leading to compression faults with a minimal trade-off of the rate distortion. This work implements an automated and effective video compression task under VVC using a deep learning approach. Motion estimation is conducted using the Motion Vector (MV) encoder-decoder model to track movements in the video. Based on these MV, the reconstruction of the frame is carried out to compensate for the motions. The residual images are obtained by using Vision Transformer-based Adaptive Recurrent MobileNet with Attention Network (ViT-ARMAN). The parameters optimization of the ViT-ARMAN is done using the Opposition-based Golden Tortoise Beetle Optimizer (OGTBO). Entropy coding is used in the training phase of the developed work to find the bit rate of residual images. Extensive experiments were conducted to demonstrate the effectiveness of the developed deep learning-based method for video compression and bit rate reduction under VVC.

视频压缩受到了视频处理和深度学习领域的关注。现代学习辅助机制使用混合编码方法来减少像素空间在时间和空间上的冗余，提高运动补偿精度。在过去的几年里，视频压缩实验有了重要的进步。通用视频编码（VVC）是视频压缩的主要增强标准，也称为H. 226。VVC编解码器是一种块辅助混合编解码器，使其功能强大且复杂。视频编码有效地压缩数据，同时减少压缩伪影，提高人工智能视频技术的质量和功能。然而，传统的模型存在运动压缩错误和运动补偿框架无效的问题，导致压缩错误，而速率失真的代价最小。本文采用深度学习的方法实现了VVC下自动有效的视频压缩任务。运动估计使用运动矢量（MV）编码器-解码器模型来跟踪视频中的运动。基于这些MV，进行帧的重建以补偿运动。残差图像采用基于视觉变换的自适应循环移动网络（vita - arman）获取。利用基于对差的金龟甲虫优化器（Golden Tortoise Beetle Optimizer， OGTBO）对vita - arman进行了参数优化。在所开发的工作的训练阶段使用熵编码来找到残差图像的比特率。大量的实验证明了所开发的基于深度学习的方法在VVC下视频压缩和比特率降低的有效性。

{"title":"A ViT-Based Adaptive Recurrent Mobilenet With Attention Network for Video Compression and Bit-Rate Reduction Using Improved Heuristic Approach Under Versatile Video Coding","authors":"D. Padmapriya, Ameelia Roseline A","doi":"10.1111/coin.70014","DOIUrl":"https://doi.org/10.1111/coin.70014","url":null,"abstract":"<div>\u0000 \u0000 <p>Video compression received attention from the communities of video processing and deep learning. Modern learning-aided mechanisms use a hybrid coding approach to reduce redundancy in pixel space across time and space, improving motion compensation accuracy. The experiments in video compression have important improvements in past years. The Versatile Video Coding (VVC) is the primary enhancing standard of video compression which is also referred to as H. 226. The VVC codec is a block-assisted hybrid codec, making it highly capable and complex. Video coding effectively compresses data while reducing compression artifacts, enhancing the quality and functionality of AI video technologies. However, the traditional models suffer from the incorrect compression of the motion and ineffective compensation frameworks of the motion leading to compression faults with a minimal trade-off of the rate distortion. This work implements an automated and effective video compression task under VVC using a deep learning approach. Motion estimation is conducted using the Motion Vector (MV) encoder-decoder model to track movements in the video. Based on these MV, the reconstruction of the frame is carried out to compensate for the motions. The residual images are obtained by using Vision Transformer-based Adaptive Recurrent MobileNet with Attention Network (ViT-ARMAN). The parameters optimization of the ViT-ARMAN is done using the Opposition-based Golden Tortoise Beetle Optimizer (OGTBO). Entropy coding is used in the training phase of the developed work to find the bit rate of residual images. Extensive experiments were conducted to demonstrate the effectiveness of the developed deep learning-based method for video compression and bit rate reduction under VVC.</p>\u0000 </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142860566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Homomorphisms and Embeddings of STRIPS Planning Models 条形规划模型的同态与嵌入

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-10 DOI: 10.1111/coin.70013

Arnaud Lequen, Martin C. Cooper, Frédéric Maris

<p>Determining whether two STRIPS planning instances are isomorphic is the simplest form of comparison between planning instances. It is also a particular case of the problem concerned with finding an isomorphism between a planning instance <span></span><math> <semantics> <mrow> <mi>P</mi> </mrow> <annotation>$$ P $$</annotation> </semantics></math> and a sub-instance of another instance <span></span><math> <semantics> <mrow> <msup> <mrow> <mi>P</mi> </mrow> <mrow> <mo>′</mo> </mrow> </msup> </mrow> <annotation>$$ {P}^{prime } $$</annotation> </semantics></math>. One application of such a mapping is to efficiently produce a compiled form containing all solutions to <span></span><math> <semantics> <mrow> <mi>P</mi> </mrow> <annotation>$$ P $$</annotation> </semantics></math> from a compiled form containing all solutions to <span></span><math> <semantics> <mrow> <msup> <mrow> <mi>P</mi> </mrow> <mrow> <mo>′</mo> </mrow> </msup> </mrow> <annotation>$$ {P}^{prime } $$</annotation> </semantics></math>. We also introduce the notion of <i>embedding</i> from an instance <span></span><math> <semantics> <mrow> <mi>P</mi> </mrow> <annotation>$$ P $$</annotation> </semantics></math> to another instance <span></span><math> <semantics> <mrow> <msup> <mrow> <mi>P</mi> </mrow> <mrow> <mo>′</mo> </mrow> </msup> </mrow> <annotation>$$ {P}^{prime } $$</annotation> </semantics></math>, which allows us to deduce that <span></span><math> <semantics> <mrow> <msup> <mrow> <mi>P</mi> </mrow> <mrow> <mo>′</mo> </mrow> </msup> </mrow> <annotation>$$ {P}^{prime } $$</annotation> </s

确定两个strip规划实例是否同构是规划实例之间最简单的比较形式。这也是关于寻找规划实例P $$ P $$和另一个实例P '的子实例之间的同构问题的一个特殊情况。$$ {P}^{prime } $$。这种映射的一个应用是有效地从包含P ‘的所有解的编译形式生成包含P ’的所有解的编译形式$$ P $$$$ {P}^{prime } $$。我们还引入了从实例P $$ P $$嵌入到另一个实例P ' $$ {P}^{prime } $$的概念，这使得我们可以推断P ' $$ {P}^{prime } $$没有解计划如果P $$ P $$是不可解的。本文研究了这些问题的复杂性。我们证明了第一个问题是gi完备的，因此在理论上可以在拟多项式时间内解决。当我们证明剩下的问题是np完全的时候，我们提出了一个在可能的情况下构建同构的算法。我们报告了对基准问题的大量实验试验，最终证明在预处理中应用约束传播可以大大提高SAT求解器的效率。

{"title":"Homomorphisms and Embeddings of STRIPS Planning Models","authors":"Arnaud Lequen, Martin C. Cooper, Frédéric Maris","doi":"10.1111/coin.70013","DOIUrl":"https://doi.org/10.1111/coin.70013","url":null,"abstract":"<p>Determining whether two STRIPS planning instances are isomorphic is the simplest form of comparison between planning instances. It is also a particular case of the problem concerned with finding an isomorphism between a planning instance <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <annotation>$$ P $$</annotation>\u0000 </semantics></math> and a sub-instance of another instance <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mo>′</mo>\u0000 </mrow>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>$$ {P}^{prime } $$</annotation>\u0000 </semantics></math>. One application of such a mapping is to efficiently produce a compiled form containing all solutions to <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <annotation>$$ P $$</annotation>\u0000 </semantics></math> from a compiled form containing all solutions to <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mo>′</mo>\u0000 </mrow>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>$$ {P}^{prime } $$</annotation>\u0000 </semantics></math>. We also introduce the notion of <i>embedding</i> from an instance <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <annotation>$$ P $$</annotation>\u0000 </semantics></math> to another instance <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mo>′</mo>\u0000 </mrow>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>$$ {P}^{prime } $$</annotation>\u0000 </semantics></math>, which allows us to deduce that <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mo>′</mo>\u0000 </mrow>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>$$ {P}^{prime } $$</annotation>\u0000 </s","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/coin.70013","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142860567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond Words: ESC-Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors 超越文字：ESC-Net通过提升视觉特征和藐视语言先验彻底改变了VQA

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence

Pub Date : 2024-12-03 DOI: 10.1111/coin.70010

Souvik Chowdhury, Badal Soni

Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.

语言先验在VQA领域是一个紧迫的问题，在这个领域中，模型提供的答案倾向于最常见的相关答案。为了缓解语言先验问题，采用了集成方法、平衡数据方法、改进的评估策略和改进的训练框架等方法。在本文中，我们提出了一个VQA模型，即“空间和通道注意网络集成（ESC-Net）”，通过改进视觉特征来克服语言偏差问题。在这项工作中，我们使用区域和全局图像特征以及组合通道和空间注意机制来改善视觉特征。该模型比现有的解决语言偏见的方法更简单有效。广泛的实验表明，与当前最先进的（SOTA）模型相比，VQACP v2数据集的性能提高了18%。

引用次数: 0