Pub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100480
Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang
Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.
由于气象数据海量且异构数据复杂,通过传统方法评估土壤肥力面临挑战。在本研究中,我们采用 K-means 算法对土壤肥力数据进行聚类分析,并在 Hadoop 框架内开发了一种新型 K-means 算法,从而解决了这些难题。我们的研究旨在利用大数据技术全面分析石河子地区的土壤肥力,尤其是绿洲棉田的土壤肥力。研究方法包括利用 2022 年 6 块圆形棉田 29 个采样点的土壤养分数据。通过不同 K 值的 K 均值聚类,我们确定将 K 设为 3 可产生最佳聚类效果,与实际土壤肥力分布密切相关。此外,我们还比较了我们提出的 K-means 算法在 MapReduce 框架下与传统串行 K-means 算法的性能,结果表明,我们的算法在运行速度和成功完成大规模数据计算方面都有显著提高。我们的研究结果表明,石河子地区的土壤肥力可分为四个不同等级,为农业实践和土地管理策略提供了宝贵的启示。这种分类有助于更好地了解绿洲棉田的土壤资源,并促进农民和政策制定者的知情决策过程。
{"title":"Assessment of soil fertility in Xinjiang oasis cotton field based on big data techniques","authors":"Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang","doi":"10.1016/j.bdr.2024.100480","DOIUrl":"10.1016/j.bdr.2024.100480","url":null,"abstract":"<div><p>Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100480"},"PeriodicalIF":3.3,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141393452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100475
Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan
AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.
AMT(音频磁法)被广泛用于获取与砂岩型铀矿床相关的地质环境,如砂体埋藏范围和基岩顶界。然而,如果不进行地质解释,就很难通过勘测断面解释这些地质环境,而地质解释在很大程度上依赖于经验和认知。另一方面,随着三维技术的发展,人工地质解释的效率和可靠性都很低。本文利用 U-net 构建了一个机器学习模型,用于那仁-义合高勒地区 AMT 数据的地质解释。为了训练该模型,根据随机模型的模拟数据建立了一个训练数据集。数据样本不足的问题已得到解决。在预测阶段,根据反演电阻率图像划分了砂体和基岩。对两种解释进行了比较,其中一种解释采用了机器学习方法,结果显示与人工解释高度一致,但更节省时间。这表明该技术比传统方法更加个性化和有效。
{"title":"Intelligent geological interpretation of AMT data based on machine learning","authors":"Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan","doi":"10.1016/j.bdr.2024.100475","DOIUrl":"10.1016/j.bdr.2024.100475","url":null,"abstract":"<div><p>AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100475"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141408443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100474
Marco Ortu, Maurizio Romano, Andrea Carta
This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.
{"title":"Semi-supervised topic representation through sentiment analysis and semantic networks","authors":"Marco Ortu, Maurizio Romano, Andrea Carta","doi":"10.1016/j.bdr.2024.100474","DOIUrl":"10.1016/j.bdr.2024.100474","url":null,"abstract":"<div><p>This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100474"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000509/pdfft?md5=46a689f4478007ad8db7233af95c8c2e&pid=1-s2.0-S2214579624000509-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141401445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100477
Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi
This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.
{"title":"Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing","authors":"Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi","doi":"10.1016/j.bdr.2024.100477","DOIUrl":"10.1016/j.bdr.2024.100477","url":null,"abstract":"<div><p>This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100477"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141415449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100476
With the increasingly severe global carbon emissions problem and the serious threat ecosystems face, carbon neutrality has gradually attracted widespread attention. This study provides an in-depth analysis of practical cases of international carbon neutrality initiatives and relevant experiences of marine cities, focusing on the construction and implementation of a legal system for economic, ecologically coordinated compensation. To evaluate the actual effectiveness of the legal system in marine cities, this study used a multiple linear regression model, considering factors such as the strictness of the legal system, enforcement efforts, and the level of participation of local enterprises and residents. The research results indicate that carbon emissions have significantly decreased in cities where legal systems are effectively enforced, from an average of 1.5 million tons per year to 1 million tons. At the same time, the economic growth rate of these cities has also significantly improved, increasing by about 2.5 percentage points from the original annual average of 4 % to 6.5 %. The study also found that the biodiversity index of these cities increased by 15 %, far higher than the average increase of 5 % in other cities, indicating the positive role of legal systems in protecting biodiversity. The public's participation rate in environmental protection activities has also increased from 25 % to 45 %, and the growth rate of green investment has reached an average of 8 % per year, far exceeding the 3 % growth rate of other cities. In terms of the ecosystem, data shows that the distribution of the ecosystem is stable, with an average ecological index of 508, which is in a relatively ideal state. The annual average growth rate of ecosystem restoration is about 3.5 %, further proving the effectiveness of ecological protection measures. Comprehensive empirical analysis shows that implementing the new legal system effectively reduces carbon emissions, enhances biodiversity, and promotes sustainable economic development. The economic growth rate increased from an average of 4.2 % to 5.1 % per year after implementing the new legal system, fully demonstrating the important role of the economic, ecologically coordinated compensation legal system in promoting carbon neutrality goals in marine cities.
{"title":"Research on the legal system of economic-ecological synergistic compensation in carbon neutral marine cities with a background in big data","authors":"","doi":"10.1016/j.bdr.2024.100476","DOIUrl":"10.1016/j.bdr.2024.100476","url":null,"abstract":"<div><p>With the increasingly severe global carbon emissions problem and the serious threat ecosystems face, carbon neutrality has gradually attracted widespread attention. This study provides an in-depth analysis of practical cases of international carbon neutrality initiatives and relevant experiences of marine cities, focusing on the construction and implementation of a legal system for economic, ecologically coordinated compensation. To evaluate the actual effectiveness of the legal system in marine cities, this study used a multiple linear regression model, considering factors such as the strictness of the legal system, enforcement efforts, and the level of participation of local enterprises and residents. The research results indicate that carbon emissions have significantly decreased in cities where legal systems are effectively enforced, from an average of 1.5 million tons per year to 1 million tons. At the same time, the economic growth rate of these cities has also significantly improved, increasing by about 2.5 percentage points from the original annual average of 4 % to 6.5 %. The study also found that the biodiversity index of these cities increased by 15 %, far higher than the average increase of 5 % in other cities, indicating the positive role of legal systems in protecting biodiversity. The public's participation rate in environmental protection activities has also increased from 25 % to 45 %, and the growth rate of green investment has reached an average of 8 % per year, far exceeding the 3 % growth rate of other cities. In terms of the ecosystem, data shows that the distribution of the ecosystem is stable, with an average ecological index of 508, which is in a relatively ideal state. The annual average growth rate of ecosystem restoration is about 3.5 %, further proving the effectiveness of ecological protection measures. Comprehensive empirical analysis shows that implementing the new legal system effectively reduces carbon emissions, enhances biodiversity, and promotes sustainable economic development. The economic growth rate increased from an average of 4.2 % to 5.1 % per year after implementing the new legal system, fully demonstrating the important role of the economic, ecologically coordinated compensation legal system in promoting carbon neutrality goals in marine cities.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100476"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141412546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Typically, graph pattern matching is expressed in terms of subgraph isomorphism. Graph simulation and its variants were introduced to reduce the time complexity and obtain more meaningful results in big graphs. Among these models, the matching subgraphs obtained by tight simulation are more compact and topologically closer to the pattern graph than results produced by other approaches. However, the number of resulting subgraphs can be huge, overlapping each other and unequally relaxed from the pattern graph. Hence, we introduce a ranking and diversification method for tight simulation results, which allows the user to obtain the most diversified and relevant matching subgraphs. This approach exploits the weights on edges of the big graph to express the interest of the matching subgraph by tight simulation. Furthermore, we provide distributed scalable algorithms to evaluate the proposed methods based on distributed programming paradigms. The experiments on real data graphs succeed in demonstrating the effectiveness of the proposed models and the efficiency of the associated algorithms. The result diversification reached 123% within a time frame that does not exceed 40%, on average, of the duration required for tight simulation graph pattern matching.
{"title":"Scalable Diversified Top-k Pattern Matching in Big Graphs","authors":"Aissam Aouar , Saïd Yahiaoui , Lamia Sadeg , Kadda Beghdad Bey","doi":"10.1016/j.bdr.2024.100464","DOIUrl":"10.1016/j.bdr.2024.100464","url":null,"abstract":"<div><p>Typically, graph pattern matching is expressed in terms of subgraph isomorphism. Graph simulation and its variants were introduced to reduce the time complexity and obtain more meaningful results in big graphs. Among these models, the matching subgraphs obtained by tight simulation are more compact and topologically closer to the pattern graph than results produced by other approaches. However, the number of resulting subgraphs can be huge, overlapping each other and unequally relaxed from the pattern graph. Hence, we introduce a ranking and diversification method for tight simulation results, which allows the user to obtain the most diversified and relevant matching subgraphs. This approach exploits the weights on edges of the big graph to express the interest of the matching subgraph by tight simulation. Furthermore, we provide distributed scalable algorithms to evaluate the proposed methods based on distributed programming paradigms. The experiments on real data graphs succeed in demonstrating the effectiveness of the proposed models and the efficiency of the associated algorithms. The result diversification reached 123% within a time frame that does not exceed 40%, on average, of the duration required for tight simulation graph pattern matching.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100464"},"PeriodicalIF":3.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141043195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-14DOI: 10.1016/j.bdr.2024.100456
Paolo Mignone , Gianvito Pio , Michelangelo Ceci
Transfer learning has proved to be effective for building predictive models even in complex conditions with a low amount of available labeled data, by constructing a predictive model for a target domain also using the knowledge coming from a separate domain, called source domain. However, several existing transfer learning methods assume identical feature spaces between the source and the target domains. This assumption limits the possible real-world applications of such methods, since two separate, although related, domains could be described by totally different feature spaces. Heterogeneous transfer learning methods aim to overcome this limitation, but they usually i) make other assumptions on the features, such as requiring the same number of features, ii) are not generally able to distribute the workload over multiple computational nodes, iii) cannot work in the Positive-Unlabeled (PU) learning setting, which we also considered in this study, or iv) their applicability is limited to specific application domains, i.e., they are not general-purpose methods.
In this manuscript, we present a novel distributed heterogeneous transfer learning method, implemented in Apache Spark, that overcomes all the above-mentioned limitations. Specifically, it is able to work also in the PU learning setting by resorting to a clustering-based approach, and can align totally heterogeneous feature spaces, without exploiting peculiarities of specific application domains. Moreover, our distributed approach allows us to process large source and target datasets.
Our experimental evaluation was performed in three different application domains that can benefit from transfer learning approaches, namely the reconstruction of the human gene regulatory network, the prediction of cerebral stroke in hospital patients, and the prediction of customer energy consumption in power grids. The results show that the proposed approach is able to outperform 4 state-of-the-art heterogeneous transfer learning approaches and 3 baselines, and exhibits ideal performances in terms of scalability.
{"title":"Distributed Heterogeneous Transfer Learning","authors":"Paolo Mignone , Gianvito Pio , Michelangelo Ceci","doi":"10.1016/j.bdr.2024.100456","DOIUrl":"10.1016/j.bdr.2024.100456","url":null,"abstract":"<div><p>Transfer learning has proved to be effective for building predictive models even in complex conditions with a low amount of available labeled data, by constructing a predictive model for a target domain also using the knowledge coming from a separate domain, called source domain. However, several existing transfer learning methods assume identical feature spaces between the source and the target domains. This assumption limits the possible real-world applications of such methods, since two separate, although related, domains could be described by totally different feature spaces. Heterogeneous transfer learning methods aim to overcome this limitation, but they usually <em>i)</em> make other assumptions on the features, such as requiring the same number of features, <em>ii)</em> are not generally able to distribute the workload over multiple computational nodes, <em>iii)</em> cannot work in the Positive-Unlabeled (PU) learning setting, which we also considered in this study, or <em>iv)</em> their applicability is limited to specific application domains, i.e., they are not general-purpose methods.</p><p>In this manuscript, we present a novel distributed heterogeneous transfer learning method, implemented in Apache Spark, that overcomes all the above-mentioned limitations. Specifically, it is able to work also in the PU learning setting by resorting to a clustering-based approach, and can align totally heterogeneous feature spaces, without exploiting peculiarities of specific application domains. Moreover, our distributed approach allows us to process large source and target datasets.</p><p>Our experimental evaluation was performed in three different application domains that can benefit from transfer learning approaches, namely the reconstruction of the human gene regulatory network, the prediction of cerebral stroke in hospital patients, and the prediction of customer energy consumption in power grids. The results show that the proposed approach is able to outperform 4 state-of-the-art heterogeneous transfer learning approaches and 3 baselines, and exhibits ideal performances in terms of scalability.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100456"},"PeriodicalIF":3.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000327/pdfft?md5=33cf99e10874514291bfc635b26d260f&pid=1-s2.0-S2214579624000327-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141025163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1016/j.bdr.2024.100463
Feiya Li , Chunyun Fu , Dongye Sun , Jian Li , Jianwen Wang
Point cloud maps generated via LiDAR sensors using extensive remotely sensed data are commonly used by autonomous vehicles and robots for localization and navigation. However, dynamic objects contained in point cloud maps not only downgrade localization accuracy and navigation performance but also jeopardize the map quality. In response to this challenge, we propose in this paper a novel semantic SLAM approach for dynamic scenes based on LiDAR point clouds, referred to as SD-SLAM hereafter. The main contributions of this work are in three aspects: 1) introducing a semantic SLAM framework dedicatedly for dynamic scenes based on LiDAR point clouds, 2) employing semantics and Kalman filtering to effectively differentiate between dynamic and semi-static landmarks, and 3) making full use of semi-static and pure static landmarks with semantic information in the SD-SLAM process to improve localization and mapping performance. To evaluate the proposed SD-SLAM, tests were conducted using the widely adopted KITTI odometry dataset. Results demonstrate that the proposed SD-SLAM effectively mitigates the adverse effects of dynamic objects on SLAM, improving vehicle localization and mapping performance in dynamic scenes, and simultaneously constructing a static semantic map with multiple semantic classes for enhanced environment understanding.
通过使用大量遥感数据的激光雷达传感器生成的点云图通常被自动驾驶车辆和机器人用于定位和导航。然而,点云图中包含的动态物体不仅会降低定位精度和导航性能,还会损害地图质量。为了应对这一挑战,我们在本文中提出了一种基于激光雷达点云的新型动态场景语义 SLAM 方法,以下简称 SD-SLAM。这项工作的主要贡献体现在三个方面:1)基于激光雷达点云为动态场景引入专用的语义 SLAM 框架;2)采用语义学和卡尔曼滤波技术有效区分动态和半静态地标;3)在 SD-SLAM 过程中充分利用半静态和纯静态地标的语义信息,提高定位和绘图性能。为了评估所提出的 SD-SLAM,我们使用广泛采用的 KITTI 测速数据集进行了测试。结果表明,所提出的 SD-SLAM 能有效减轻动态物体对 SLAM 的不利影响,提高车辆在动态场景中的定位和映射性能,并同时构建具有多个语义类别的静态语义地图,以增强对环境的理解。
{"title":"SD-SLAM: A semantic SLAM approach for dynamic scenes based on LiDAR point clouds","authors":"Feiya Li , Chunyun Fu , Dongye Sun , Jian Li , Jianwen Wang","doi":"10.1016/j.bdr.2024.100463","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100463","url":null,"abstract":"<div><p>Point cloud maps generated via LiDAR sensors using extensive remotely sensed data are commonly used by autonomous vehicles and robots for localization and navigation. However, dynamic objects contained in point cloud maps not only downgrade localization accuracy and navigation performance but also jeopardize the map quality. In response to this challenge, we propose in this paper a novel semantic SLAM approach for dynamic scenes based on LiDAR point clouds, referred to as SD-SLAM hereafter. The main contributions of this work are in three aspects: 1) introducing a semantic SLAM framework dedicatedly for dynamic scenes based on LiDAR point clouds, 2) employing semantics and Kalman filtering to effectively differentiate between dynamic and semi-static landmarks, and 3) making full use of semi-static and pure static landmarks with semantic information in the SD-SLAM process to improve localization and mapping performance. To evaluate the proposed SD-SLAM, tests were conducted using the widely adopted KITTI odometry dataset. Results demonstrate that the proposed SD-SLAM effectively mitigates the adverse effects of dynamic objects on SLAM, improving vehicle localization and mapping performance in dynamic scenes, and simultaneously constructing a static semantic map with multiple semantic classes for enhanced environment understanding.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100463"},"PeriodicalIF":3.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141083349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1016/j.bdr.2024.100465
Mughair Aslam Bhatti , M.S. Syam , Huafeng Chen , Yurong Hu , Li Wai Keung , Zeeshan Zeeshan , Yasser A. Ali , Nadia Sarhan
This study presents the implementation and evaluation of a convolutional neural network (CNN) based image segmentation model using the U-Net architecture for forest image segmentation. The proposed algorithm starts by preprocessing the datasets of satellite images and corresponding masks from a repository source. Data preprocessing involves resizing, normalizing, and splitting the images and masks into training and testing datasets. The U-Net model architecture, comprising encoder and decoder parts with skip connections, is defined and compiled with binary cross-entropy loss and Adam optimizer. Training includes early stopping and checkpoint saving mechanisms to prevent overfitting and retain the best model weights. Evaluation metrics such as Intersection over Union (IoU), Dice coefficient, pixel accuracy, precision, recall, specificity, and F1-score are computed to assess the model's performance. Visualization of results includes comparing predicted segmentation masks with ground truth masks for qualitative analysis. The study emphasizes the importance of training data size in achieving accurate segmentation models and highlights the potential of U-Net architecture for forest image segmentation tasks.
本研究介绍了基于卷积神经网络(CNN)的图像分割模型的实现和评估,该模型采用 U-Net 架构,用于森林图像分割。所提出的算法首先要对卫星图像数据集和来自资源库的相应掩码进行预处理。数据预处理包括调整大小、归一化以及将图像和掩码分割成训练数据集和测试数据集。U-Net 模型架构由编码器和解码器两部分组成,采用二进制交叉熵损失和亚当优化器进行定义和编译。训练包括早期停止和检查点保存机制,以防止过度拟合并保留最佳模型权重。为了评估模型的性能,还计算了一些评估指标,如联合交叉(IoU)、骰子系数、像素精度、精确度、召回率、特异性和 F1 分数。结果的可视化包括比较预测的分割掩码和地面实况掩码,以进行定性分析。该研究强调了训练数据量对实现精确分割模型的重要性,并突出了 U-Net 架构在森林图像分割任务中的潜力。
{"title":"Utilizing convolutional neural networks (CNN) and U-Net architecture for precise crop and weed segmentation in agricultural imagery: A deep learning approach","authors":"Mughair Aslam Bhatti , M.S. Syam , Huafeng Chen , Yurong Hu , Li Wai Keung , Zeeshan Zeeshan , Yasser A. Ali , Nadia Sarhan","doi":"10.1016/j.bdr.2024.100465","DOIUrl":"10.1016/j.bdr.2024.100465","url":null,"abstract":"<div><p>This study presents the implementation and evaluation of a convolutional neural network (CNN) based image segmentation model using the U-Net architecture for forest image segmentation. The proposed algorithm starts by preprocessing the datasets of satellite images and corresponding masks from a repository source. Data preprocessing involves resizing, normalizing, and splitting the images and masks into training and testing datasets. The U-Net model architecture, comprising encoder and decoder parts with skip connections, is defined and compiled with binary cross-entropy loss and Adam optimizer. Training includes early stopping and checkpoint saving mechanisms to prevent overfitting and retain the best model weights. Evaluation metrics such as Intersection over Union (IoU), Dice coefficient, pixel accuracy, precision, recall, specificity, and F1-score are computed to assess the model's performance. Visualization of results includes comparing predicted segmentation masks with ground truth masks for qualitative analysis. The study emphasizes the importance of training data size in achieving accurate segmentation models and highlights the potential of U-Net architecture for forest image segmentation tasks.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100465"},"PeriodicalIF":3.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141026200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.bdr.2024.100461
Yanan Wu, Rong Mei, Jie Xu
This paper considers the non pilot data-aided estimation of the carrier frequency offset (CFO) and sample frequency offset (SFO) of orthogonal frequency division multiplexing (OFDM) signals in fast time-varying channel. The main obstacle is the time-variant channel response, which deteriorates the estimation validity. A practical approach to mitigate this impact is to reduce the time consumption of one-shot estimation. In this way, we propose a method to reduce the time consumption to within one OFDM symbol duration. The maximum likelihood (ML) estimator is derived based on the observations of frequency domain constellations output of two FFTs on one symbol; its closed-form approximation is then derived to reduce the calculation burden. Remarkably, our method does not require any training symbol or pilot tone embedded in the signal spectrum, therefore achieves the highest spectral efficiency. Theoretical analysis and simulation results are employed to assess the performance of proposed method in comparison with existing alternatives.
{"title":"Non pilot data-aided carrier and sampling frequency offsets estimation in fast time-varying channel","authors":"Yanan Wu, Rong Mei, Jie Xu","doi":"10.1016/j.bdr.2024.100461","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100461","url":null,"abstract":"<div><p>This paper considers the non pilot data-aided estimation of the carrier frequency offset (CFO) and sample frequency offset (SFO) of orthogonal frequency division multiplexing (OFDM) signals in fast time-varying channel. The main obstacle is the time-variant channel response, which deteriorates the estimation validity. A practical approach to mitigate this impact is to reduce the time consumption of one-shot estimation. In this way, we propose a method to reduce the time consumption to within one OFDM symbol duration. The maximum likelihood (ML) estimator is derived based on the observations of frequency domain constellations output of two FFTs on one symbol; its closed-form approximation is then derived to reduce the calculation burden. Remarkably, our method does not require any training symbol or pilot tone embedded in the signal spectrum, therefore achieves the highest spectral efficiency. Theoretical analysis and simulation results are employed to assess the performance of proposed method in comparison with existing alternatives.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100461"},"PeriodicalIF":3.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}