Pub Date : 2025-02-20DOI: 10.1016/j.bdr.2025.100510
Rafael Teixeira , Leonardo Almeida , Mário Antunes , Diogo Gomes , Rui L. Aguiar
With the rapid development of 6G, Artificial Intelligence (AI) is expected to play a pivotal role in network management, resource optimization, and intrusion detection. However, deploying AI models in 6G networks faces several challenges, such as the lack of dedicated hardware for AI tasks and the need to protect user privacy. To address these challenges, Federated Learning (FL) emerges as a promising solution for distributed AI training without the need to move data from users' devices. This paper investigates the performance and costs of different FL approaches regarding training time, communication overhead, and energy consumption. The results show that FL can significantly accelerate the training process while reducing the data transferred across the network. However, the effectiveness of FL depends on the specific FL approach and the network conditions.
{"title":"Efficient training: Federated learning cost analysis","authors":"Rafael Teixeira , Leonardo Almeida , Mário Antunes , Diogo Gomes , Rui L. Aguiar","doi":"10.1016/j.bdr.2025.100510","DOIUrl":"10.1016/j.bdr.2025.100510","url":null,"abstract":"<div><div>With the rapid development of 6G, Artificial Intelligence (AI) is expected to play a pivotal role in network management, resource optimization, and intrusion detection. However, deploying AI models in 6G networks faces several challenges, such as the lack of dedicated hardware for AI tasks and the need to protect user privacy. To address these challenges, Federated Learning (FL) emerges as a promising solution for distributed AI training without the need to move data from users' devices. This paper investigates the performance and costs of different FL approaches regarding training time, communication overhead, and energy consumption. The results show that FL can significantly accelerate the training process while reducing the data transferred across the network. However, the effectiveness of FL depends on the specific FL approach and the network conditions.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100510"},"PeriodicalIF":3.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research aims to improve the accuracy and efficiency of Optical Character Recognition (OCR) technology for the Thai language, specifically in the context of Thai government documents. OCR enables the conversion of text from images into machine-readable format, facilitating document storage and further processing. However, applying OCR to the Thai language presents unique challenges due to its complexity. This study focuses on enhancing the performance of the Tesseract OCR engine, a widely used free OCR technology, by implementing various image preprocessing techniques such as masking, adaptive thresholds, median filtering, Canny edge detection, and morphological operators. A dataset of Thai documents is utilized, and the OCR system's output is evaluated using word error rate (WER) and character error rate (CER) metrics. To improve text extraction accuracy, the research employs the original U-Net architecture [19] for image segmentation. Furthermore, the Tesseract OCR engine is finetuned, and image preprocessing is performed to optimize OCR system accuracy. The developed tools automate workflow processes, alleviate constraints on model training, and enable the effective utilization of information from official Thai documents for various purposes.
{"title":"Improved Tesseract optical character recognition performance on Thai document datasets","authors":"Noppol Anakpluek, Watcharakorn Pasanta, Latthawan Chantharasukha, Pattanawong Chokratansombat, Pajaya Kanjanakaew, Thitirat Siriborvornratanakul","doi":"10.1016/j.bdr.2025.100508","DOIUrl":"10.1016/j.bdr.2025.100508","url":null,"abstract":"<div><div>This research aims to improve the accuracy and efficiency of Optical Character Recognition (OCR) technology for the Thai language, specifically in the context of Thai government documents. OCR enables the conversion of text from images into machine-readable format, facilitating document storage and further processing. However, applying OCR to the Thai language presents unique challenges due to its complexity. This study focuses on enhancing the performance of the Tesseract OCR engine, a widely used free OCR technology, by implementing various image preprocessing techniques such as masking, adaptive thresholds, median filtering, Canny edge detection, and morphological operators. A dataset of Thai documents is utilized, and the OCR system's output is evaluated using word error rate (WER) and character error rate (CER) metrics. To improve text extraction accuracy, the research employs the original U-Net architecture [<span><span>19</span></span>] for image segmentation. Furthermore, the Tesseract OCR engine is finetuned, and image preprocessing is performed to optimize OCR system accuracy. The developed tools automate workflow processes, alleviate constraints on model training, and enable the effective utilization of information from official Thai documents for various purposes.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100508"},"PeriodicalIF":3.5,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-07DOI: 10.1016/j.bdr.2025.100509
Rubén Alonso , Danilo Dessí , Antonello Meloni , Diego Reforgiato Recupero
Today we have tons of information posted on the web every day regarding job supply and demand which has heavily affected the job market. The online enrolling process has thus become efficient for applicants as it allows them to present their resumes using the Internet and, as such, simultaneously to numerous organizations. Online systems such as Monster.com, OfferZen, and LinkedIn contain millions of job offers and resumes of potential candidates leaving to companies with the hard task to face an enormous amount of data to manage to select the most suitable applicant. The task of assessing the resumes of candidates and providing automatic recommendations on which one suits a particular position best has, therefore, become essential to speed up the hiring process. Similarly, it is important to help applicants to quickly find a job appropriate to their skills and provide recommendations about what they need to master to become eligible for certain jobs. Our approach lies in this context and proposes a new method to identify skills from candidates' resumes and match resumes with job descriptions. We employed the O*NET database entities related to different skills and abilities required by different jobs; moreover, we leveraged deep learning technologies to compute the semantic similarity between O*NET entities and part of text extracted from candidates' resumes. The ultimate goal is to identify the most suitable job for a certain resume according to the information there contained. We have defined two scenarios: i) given a resume, identify the top O*NET occupations with the highest match with the resume, ii) given a candidate's resume and a set of job descriptions, identify which one of the input jobs is the most suitable for the candidate. The evaluation that has been carried out indicates that the proposed approach outperforms the baselines in the two scenarios. Finally, we provide a use case for candidates where it is possible to recommend courses with the goal to fill certain skills and make them qualified for a certain job.
{"title":"A novel approach for job matching and skill recommendation using transformers and the O*NET database","authors":"Rubén Alonso , Danilo Dessí , Antonello Meloni , Diego Reforgiato Recupero","doi":"10.1016/j.bdr.2025.100509","DOIUrl":"10.1016/j.bdr.2025.100509","url":null,"abstract":"<div><div>Today we have tons of information posted on the web every day regarding job supply and demand which has heavily affected the job market. The online enrolling process has thus become efficient for applicants as it allows them to present their resumes using the Internet and, as such, simultaneously to numerous organizations. Online systems such as Monster.com, OfferZen, and LinkedIn contain millions of job offers and resumes of potential candidates leaving to companies with the hard task to face an enormous amount of data to manage to select the most suitable applicant. The task of assessing the resumes of candidates and providing automatic recommendations on which one suits a particular position best has, therefore, become essential to speed up the hiring process. Similarly, it is important to help applicants to quickly find a job appropriate to their skills and provide recommendations about what they need to master to become eligible for certain jobs. Our approach lies in this context and proposes a new method to identify skills from candidates' resumes and match resumes with job descriptions. We employed the O*NET database entities related to different skills and abilities required by different jobs; moreover, we leveraged deep learning technologies to compute the semantic similarity between O*NET entities and part of text extracted from candidates' resumes. The ultimate goal is to identify the most suitable job for a certain resume according to the information there contained. We have defined two scenarios: i) given a resume, identify the top O*NET occupations with the highest match with the resume, ii) given a candidate's resume and a set of job descriptions, identify which one of the input jobs is the most suitable for the candidate. The evaluation that has been carried out indicates that the proposed approach outperforms the baselines in the two scenarios. Finally, we provide a use case for candidates where it is possible to recommend courses with the goal to fill certain skills and make them qualified for a certain job.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100509"},"PeriodicalIF":3.5,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143377085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-22DOI: 10.1016/j.bdr.2025.100507
Iqra Muneer , Aysha Shehzadi , Muhammad Adnan Ashraf , Rao Muhammad Adeel Nawab
In recent years, automatic text rewriting (or paraphrasing) tools are readily and publicly available. These tools have enabled text paraphrasing as an exceptionally straightforward approach that encourages trouble-free plagiarism and text reuse. In literature, the majority of efforts have focused on detecting real cases (manual/human paraphrasing) of paraphrasing (mainly in the domain of journalism). However, the problem of paraphrase detection has not been thoroughly explored for artificial cases (machine paraphrased), mainly, due to lack of standard resources for its evaluation. To fulfill this gap, this study proposes three benchmark corpora for artificial cases of paraphrases at sentence level, and one real corpus contains examples from daily life activities. Three popular and widely used automatic text rewriting online tools have been used, i.e., paraphrasing-tools, articlerewritetool and rewritertools, to develop artificial case corpora. Further, we used two real cases corpora, including Microsoft Paraphrase Corpus (MSRP) (from the domain of journalism) and a proposed real corpus which is a combination of carefully extracted Quora question pairs and MSRP (Q-MSRP). Both real case and artificial case paraphrases were evaluated using classical machine learning, transfer learning, Large language models and a proposed model, to investigate which of the two types of paraphrasing is more difficult to detect. The results show that our proposed model outperforms all the other approaches for both artificial and real case paraphrase detection. A thorough analysis of the results suggests that, by far, manual paraphrasing is still harder to detect but certain machine paraphrased texts are equally difficult to detect. All proposed corpora are freely available to promote the research on artificial case paraphrase detection.
{"title":"Has machine paraphrasing skills approached humans? Detecting automatically and manually generated paraphrased cases","authors":"Iqra Muneer , Aysha Shehzadi , Muhammad Adnan Ashraf , Rao Muhammad Adeel Nawab","doi":"10.1016/j.bdr.2025.100507","DOIUrl":"10.1016/j.bdr.2025.100507","url":null,"abstract":"<div><div>In recent years, automatic text rewriting (or paraphrasing) tools are readily and publicly available. These tools have enabled text paraphrasing as an exceptionally straightforward approach that encourages trouble-free plagiarism and text reuse. In literature, the majority of efforts have focused on detecting real cases (manual/human paraphrasing) of paraphrasing (mainly in the domain of journalism). However, the problem of paraphrase detection has not been thoroughly explored for artificial cases (machine paraphrased), mainly, due to lack of standard resources for its evaluation. To fulfill this gap, this study proposes three benchmark corpora for artificial cases of paraphrases at sentence level, and one real corpus contains examples from daily life activities. Three popular and widely used automatic text rewriting online tools have been used, i.e., paraphrasing-tools, articlerewritetool and rewritertools, to develop artificial case corpora. Further, we used two real cases corpora, including Microsoft Paraphrase Corpus (MSRP) (from the domain of journalism) and a proposed real corpus which is a combination of carefully extracted Quora question pairs and MSRP (Q-MSRP). Both real case and artificial case paraphrases were evaluated using classical machine learning, transfer learning, Large language models and a proposed model, to investigate which of the two types of paraphrasing is more difficult to detect. The results show that our proposed model outperforms all the other approaches for both artificial and real case paraphrase detection. A thorough analysis of the results suggests that, by far, manual paraphrasing is still harder to detect but certain machine paraphrased texts are equally difficult to detect. All proposed corpora are freely available to promote the research on artificial case paraphrase detection.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100507"},"PeriodicalIF":3.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143092345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-17DOI: 10.1016/j.bdr.2025.100506
Mingwei Tang , Kun Yang , Linping Tao , Mingfeng Zhao , Wei Zhou
Aspect Sentiment Triple Extraction (ASTE) is an emerging sentiment analysis task, which describes both aspect terms and their sentiment polarity, as well as opinion terms that represent sentiment polarity. Some models have been presented to analyze sentence sentiment more accurately. Nonetheless, previous models have had problems, like inconsistent sentiment predictions for one-to-many, many-to-one, and sequence annotation. In addition, part-of-speech and contextual semantic information are ignored, resulting in the inability to identify complete multi-word aspect terms and opinion terms. To address these problems, we propose a Multi-granularity Enhanced Graph Convolutional Network (MGEGCN) to solve the problem of inaccurate multi-word term recognition. First, we propose a dual-channel enhanced graph convolutional network, which simultaneously analyzes syntactic structure and part-of-speech information and uses the combined effect of the two to enhance the deep semantic information of aspect terms and opinion terms. Second, we also design a multi-scale attention, which combines self-attention with deep separable convolution to enhance attention to aspect terms and opinion terms. In addition, a convolutional decoding strategy is used in the decoding stage to extract triples by directly detecting and classifying the relational regions in the table. In the experimental part, we conduct analysis on two public datasets (ASTE-DATA-v1 and ASTE-DATA-v2) to prove that the model improves the performance of ASTE tasks. In four subsets (14res, 14lap, 15res, and 16res), the F1 scores of the MGEGCN method are 75.65%, 61.62%, 67.62%, 74.12% and 74.69%, 62.10%, 68.18%, 74.00%, respectively.
方面情感三重提取(ASTE)是一种新兴的情感分析任务,它既描述方面术语及其情感极性,也描述代表情感极性的意见术语。为了更准确地分析句子情感,已经提出了一些模型。尽管如此,以前的模型存在一些问题,比如一对多、多对一和序列注释的情感预测不一致。此外,词性和上下文语义信息被忽略,导致无法识别完整的多词方面术语和意见术语。为了解决这些问题,我们提出了一种多粒度增强图卷积网络(MGEGCN)来解决多词术语识别不准确的问题。首先,我们提出了一种双通道增强图卷积网络,该网络同时分析句法结构和词性信息,并利用两者的联合作用增强方面词和意见词的深层语义信息。其次,我们还设计了一个多尺度注意,将自我注意与深度可分离卷积相结合,以增强对方面项和意见项的注意。此外,在解码阶段采用卷积解码策略,通过直接检测和分类表中的关系区域提取三元组。在实验部分,我们对两个公共数据集(ASTE- data -v1和ASTE- data -v2)进行了分析,证明该模型提高了ASTE任务的性能。在14res、14lap、15res和16res 4个子集中,MGEGCN方法的F1得分分别为75.65%、61.62%、67.62%、74.12%和74.69%、62.10%、68.18%、74.00%。
{"title":"Multi-granularity enhanced graph convolutional network for aspect sentiment triplet extraction","authors":"Mingwei Tang , Kun Yang , Linping Tao , Mingfeng Zhao , Wei Zhou","doi":"10.1016/j.bdr.2025.100506","DOIUrl":"10.1016/j.bdr.2025.100506","url":null,"abstract":"<div><div>Aspect Sentiment Triple Extraction (ASTE) is an emerging sentiment analysis task, which describes both aspect terms and their sentiment polarity, as well as opinion terms that represent sentiment polarity. Some models have been presented to analyze sentence sentiment more accurately. Nonetheless, previous models have had problems, like inconsistent sentiment predictions for one-to-many, many-to-one, and sequence annotation. In addition, part-of-speech and contextual semantic information are ignored, resulting in the inability to identify complete multi-word aspect terms and opinion terms. To address these problems, we propose a <em>Multi-granularity Enhanced Graph Convolutional Network</em> (MGEGCN) to solve the problem of inaccurate multi-word term recognition. First, we propose a dual-channel enhanced graph convolutional network, which simultaneously analyzes syntactic structure and part-of-speech information and uses the combined effect of the two to enhance the deep semantic information of aspect terms and opinion terms. Second, we also design a multi-scale attention, which combines self-attention with deep separable convolution to enhance attention to aspect terms and opinion terms. In addition, a convolutional decoding strategy is used in the decoding stage to extract triples by directly detecting and classifying the relational regions in the table. In the experimental part, we conduct analysis on two public datasets (ASTE-DATA-v1 and ASTE-DATA-v2) to prove that the model improves the performance of ASTE tasks. In four subsets (14res, 14lap, 15res, and 16res), the F1 scores of the MGEGCN method are 75.65%, 61.62%, 67.62%, 74.12% and 74.69%, 62.10%, 68.18%, 74.00%, respectively.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100506"},"PeriodicalIF":3.5,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143092348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-16DOI: 10.1016/j.bdr.2024.100505
S. Anjali Devi , M. Sitha Ram , Pulugu Dileep , Sasibhushana Rao Pappu , T. Subha Mastan Rao , Mula Malyadri
With the rapid growth of Internet technology and social networks, the generation of text-based information on the web is increased. To ease the Natural Language Processing (NLP) tasks, analyzing the sentiments behind the provided input text is highly important. To effectively analyze the polarities of sentiments (positive, negative and neutral), categorizing the aspects in the text is an essential task. Several existing studies have attempted to accurately classify aspects based on sentiments in text inputs. However, the existing methods attained limited performance because of reduced aspect coverage, inefficiency in handling ambiguous language, inappropriate feature extraction, lack of contextual understanding and overfitting issues. Thus, the proposed study intends to develop an effective word embedding scheme with a novel hybrid deep learning technique for performing aspect-based sentimental analysis in a social media text. Initially, the collected raw input text data are pre-processed to reduce the undesirable data by initiating tokenization, stemming, lemmatization, duplicate removal, stop words removal, empty sets removal and empty rows removal. The required information from the pre-processed text is extracted using three varied word-level embedding methods: Scored-Lexicon based Word2Vec, Glove modelling and Extended Bidirectional Encoder Representation from Transformers (E-BERT). After extracting sufficient features, the aspects are analyzed, and the exact sentimental polarities are classified through a novel Positional-Attention-based Bidirectional Deep Stacked AutoEncoder (PA_BiDSAE) model. In this proposed classification, the BiLSTM network is hybridized with a deep stacked autoencoder (DSAE) model to categorize sentiment. The experimental analysis is done by using Python software, and the proposed model is simulated with three publicly available datasets: SemEval Challenge 2014 (Restaurant), SemEval Challenge 2014 (Laptop) and SemEval Challenge 2015 (Restaurant). The performance analysis proves that the proposed hybrid deep learning model obtains improved classification performance in accuracy, precision, recall, specificity, F1 score and kappa measure.
{"title":"Positional-attention based bidirectional deep stacked AutoEncoder for aspect based sentimental analysis","authors":"S. Anjali Devi , M. Sitha Ram , Pulugu Dileep , Sasibhushana Rao Pappu , T. Subha Mastan Rao , Mula Malyadri","doi":"10.1016/j.bdr.2024.100505","DOIUrl":"10.1016/j.bdr.2024.100505","url":null,"abstract":"<div><div>With the rapid growth of Internet technology and social networks, the generation of text-based information on the web is increased. To ease the Natural Language Processing (NLP) tasks, analyzing the sentiments behind the provided input text is highly important. To effectively analyze the polarities of sentiments (positive, negative and neutral), categorizing the aspects in the text is an essential task. Several existing studies have attempted to accurately classify aspects based on sentiments in text inputs. However, the existing methods attained limited performance because of reduced aspect coverage, inefficiency in handling ambiguous language, inappropriate feature extraction, lack of contextual understanding and overfitting issues. Thus, the proposed study intends to develop an effective word embedding scheme with a novel hybrid deep learning technique for performing aspect-based sentimental analysis in a social media text. Initially, the collected raw input text data are pre-processed to reduce the undesirable data by initiating tokenization, stemming, lemmatization, duplicate removal, stop words removal, empty sets removal and empty rows removal. The required information from the pre-processed text is extracted using three varied word-level embedding methods: Scored-Lexicon based Word2Vec, Glove modelling and Extended Bidirectional Encoder Representation from Transformers (E-BERT). After extracting sufficient features, the aspects are analyzed, and the exact sentimental polarities are classified through a novel Positional-Attention-based Bidirectional Deep Stacked AutoEncoder (PA_BiDSAE) model. In this proposed classification, the BiLSTM network is hybridized with a deep stacked autoencoder (DSAE) model to categorize sentiment. The experimental analysis is done by using Python software, and the proposed model is simulated with three publicly available datasets: SemEval Challenge 2014 (Restaurant), SemEval Challenge 2014 (Laptop) and SemEval Challenge 2015 (Restaurant). The performance analysis proves that the proposed hybrid deep learning model obtains improved classification performance in accuracy, precision, recall, specificity, F1 score and kappa measure.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100505"},"PeriodicalIF":3.5,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143092347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper is devoted to the study of dimension reduction techniques for multivariate spatially indexed functional data and defined on different domains. We present a method called Spatial Multivariate Functional Principal Component Analysis (SMFPCA), which performs principal component analysis for multivariate spatial functional data. In contrast to Multivariate Karhunen-Loève approach for independent data, SMFPCA is notably adept at effectively capturing spatial dependencies among multiple functions. SMFPCA applies spectral functional component analysis to multivariate functional spatial data, focusing on data points arranged on a regular grid. The methodological framework and algorithm of SMFPCA have been developed to tackle the challenges arising from the lack of appropriate methods for managing this type of data. The performance of the proposed method has been verified through finite sample properties using simulated datasets and sea-surface temperature dataset. Additionally, we conducted comparative studies of SMFPCA against some existing methods providing valuable insights into the properties of multivariate spatial functional data within a finite sample.
{"title":"Principal component analysis of multivariate spatial functional data","authors":"Idris Si-ahmed , Leila Hamdad , Christelle Judith Agonkoui , Yoba Kande , Sophie Dabo-Niang","doi":"10.1016/j.bdr.2024.100504","DOIUrl":"10.1016/j.bdr.2024.100504","url":null,"abstract":"<div><div>This paper is devoted to the study of dimension reduction techniques for multivariate spatially indexed functional data and defined on different domains. We present a method called Spatial Multivariate Functional Principal Component Analysis (SMFPCA), which performs principal component analysis for multivariate spatial functional data. In contrast to Multivariate Karhunen-Loève approach for independent data, SMFPCA is notably adept at effectively capturing spatial dependencies among multiple functions. SMFPCA applies spectral functional component analysis to multivariate functional spatial data, focusing on data points arranged on a regular grid. The methodological framework and algorithm of SMFPCA have been developed to tackle the challenges arising from the lack of appropriate methods for managing this type of data. The performance of the proposed method has been verified through finite sample properties using simulated datasets and sea-surface temperature dataset. Additionally, we conducted comparative studies of SMFPCA against some existing methods providing valuable insights into the properties of multivariate spatial functional data within a finite sample.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100504"},"PeriodicalIF":3.5,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143092346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-14DOI: 10.1016/j.bdr.2024.100496
Yuanting Yan , Meili Yang , Zhong Zheng , Hao Ge , Yiwen Zhang , Yanping Zhang
Classifying incomplete data using ensemble techniques is a prevalent method for addressing missing values, where multiple classifiers are trained on diverse subsets of features. However, current ensemble-based methods overlook the redundancy within feature subsets, presenting challenges for training robust prediction models, because the redundant features can hinder the learning of the underlying rules in the data. In this paper, we propose a Reduct-Missing Pattern Fusion (RMPF) method to address the aforementioned limitation. It leverages both the advantages of rough set theory and the effectiveness of missing patterns in classifying incomplete data. RMPF employs a heuristic algorithm to generate a set of positive approximation-based attribute reducts. Subsequently, it integrates the missing patterns with these reducts through a fusion strategy to minimize data redundancy. Finally, the optimized subsets are utilized to train a group of base classifiers, and a selective prediction procedure is applied to produce the ensembled prediction results. Experimental results show that our method is superior to the compared state-of-the-art methods in both performance and robustness. Especially, our method obtains significant superiority in the scenarios of data with high missing rates.
{"title":"Incomplete data classification via positive approximation based rough subspaces ensemble","authors":"Yuanting Yan , Meili Yang , Zhong Zheng , Hao Ge , Yiwen Zhang , Yanping Zhang","doi":"10.1016/j.bdr.2024.100496","DOIUrl":"10.1016/j.bdr.2024.100496","url":null,"abstract":"<div><div>Classifying incomplete data using ensemble techniques is a prevalent method for addressing missing values, where multiple classifiers are trained on diverse subsets of features. However, current ensemble-based methods overlook the redundancy within feature subsets, presenting challenges for training robust prediction models, because the redundant features can hinder the learning of the underlying rules in the data. In this paper, we propose a Reduct-Missing Pattern Fusion (RMPF) method to address the aforementioned limitation. It leverages both the advantages of rough set theory and the effectiveness of missing patterns in classifying incomplete data. RMPF employs a heuristic algorithm to generate a set of positive approximation-based attribute reducts. Subsequently, it integrates the missing patterns with these reducts through a fusion strategy to minimize data redundancy. Finally, the optimized subsets are utilized to train a group of base classifiers, and a selective prediction procedure is applied to produce the ensembled prediction results. Experimental results show that our method is superior to the compared state-of-the-art methods in both performance and robustness. Especially, our method obtains significant superiority in the scenarios of data with high missing rates.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100496"},"PeriodicalIF":3.5,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-13DOI: 10.1016/j.bdr.2024.100495
Jin Liu, Jianye Chen, Chongfeng Fan, Fengyu Zhou
The link prediction task aims to predict missing entities or relations in the knowledge graph and is essential for the downstream application. Existing well-known models deal with this task by mainly focusing on representing knowledge graph triplets in the distance space or semantic space. However, they can not fully capture the information of head and tail entities, nor even make good use of hierarchical level information. Thus, in this paper, we propose a novel knowledge graph embedding model for the link prediction task, namely, HIE, which models each triplet (h, r, t) into distance measurement space and semantic measurement space, simultaneously. Moreover, HIE is introduced into hierarchical-aware space to leverage rich hierarchical information of entities and relations for better representation learning. Specifically, we apply distance transformation operation on the head entity in distance space to obtain the tail entity instead of translation-based or rotation-based approaches. Experimental results of HIE on four real-world datasets show that HIE outperforms several existing state-of-the-art knowledge graph embedding methods on the link prediction task and deals with complex relations accurately.
{"title":"Joint embedding in hierarchical distance and semantic representation learning for link prediction","authors":"Jin Liu, Jianye Chen, Chongfeng Fan, Fengyu Zhou","doi":"10.1016/j.bdr.2024.100495","DOIUrl":"10.1016/j.bdr.2024.100495","url":null,"abstract":"<div><div>The link prediction task aims to predict missing entities or relations in the knowledge graph and is essential for the downstream application. Existing well-known models deal with this task by mainly focusing on representing knowledge graph triplets in the distance space or semantic space. However, they can not fully capture the information of head and tail entities, nor even make good use of hierarchical level information. Thus, in this paper, we propose a novel knowledge graph embedding model for the link prediction task, namely, HIE, which models each triplet (<em>h</em>, <em>r</em>, <em>t</em>) into distance measurement space and semantic measurement space, simultaneously. Moreover, HIE is introduced into hierarchical-aware space to leverage rich hierarchical information of entities and relations for better representation learning. Specifically, we apply distance transformation operation on the head entity in distance space to obtain the tail entity instead of translation-based or rotation-based approaches. Experimental results of HIE on four real-world datasets show that HIE outperforms several existing state-of-the-art knowledge graph embedding methods on the link prediction task and deals with complex relations accurately.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100495"},"PeriodicalIF":3.5,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1016/j.bdr.2024.100494
Zhihui Lai , Xiaomei Fang , Heng Kong
Cross-modal hashing has been paid widespread attention in recent years due to its outstanding performance in cross-modal data retrieval. Cross-modal hashing can be decomposed into two steps, i.e., the feature learning and the binarization. However, most existing cross-modal hash methods do not take the supervisory information of the data into consideration during binary quantization, and thus often fail to adequately preserve semantic information. To solve these problems, this paper proposes a novel deep cross-modal hashing method called deep semantics-preserving cross-modal hashing (DSCMH), which makes full use of intra and inter-modal semantic information to improve the model's performance. Moreover, by designing a label network for semantic alignment during the binarization process, DSCMH's performance can be further improved. In order to verify the performance of the proposed method, extensive experiments were conducted on four big datasets. The results show that the proposed method is better than most of the existing cross-modal hashing methods. In addition, the ablation experiment shows that the proposed new regularized terms all have positive effects on the model's performances in cross-modal retrieval. The code of this paper can be downloaded from http://www.scholat.com/laizhihui.
{"title":"Deep semantics-preserving cross-modal hashing","authors":"Zhihui Lai , Xiaomei Fang , Heng Kong","doi":"10.1016/j.bdr.2024.100494","DOIUrl":"10.1016/j.bdr.2024.100494","url":null,"abstract":"<div><div>Cross-modal hashing has been paid widespread attention in recent years due to its outstanding performance in cross-modal data retrieval. Cross-modal hashing can be decomposed into two steps, i.e., the feature learning and the binarization. However, most existing cross-modal hash methods do not take the supervisory information of the data into consideration during binary quantization, and thus often fail to adequately preserve semantic information. To solve these problems, this paper proposes a novel deep cross-modal hashing method called deep semantics-preserving cross-modal hashing (DSCMH), which makes full use of intra and inter-modal semantic information to improve the model's performance. Moreover, by designing a label network for semantic alignment during the binarization process, DSCMH's performance can be further improved. In order to verify the performance of the proposed method, extensive experiments were conducted on four big datasets. The results show that the proposed method is better than most of the existing cross-modal hashing methods. In addition, the ablation experiment shows that the proposed new regularized terms all have positive effects on the model's performances in cross-modal retrieval. The code of this paper can be downloaded from <span><span>http://www.scholat.com/laizhihui</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100494"},"PeriodicalIF":3.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}