Big Data Mining and Analytics最新文献_第7页

RF-PSSM: A Combination of Rotation Forest Algorithm and Position-Specific Scoring Matrix for Improved Prediction of Protein-Protein Interactions Between Hepatitis C Virus and Human RF-PSSM：旋转森林算法和位置特异性评分矩阵的结合改进了丙型肝炎病毒与人之间蛋白质-蛋白质相互作用的预测

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-11-24 DOI: 10.26599/BDMA.2022.9020031

Xin Liu;Yaping Lu;Liang Wang;Wei Geng;Xinyi Shi;Xiao Zhang

The identification of hepatitis C virus (HCV) virus-human protein interactions will not only help us understand the molecular mechanisms of related diseases but also be conductive to discovering new drug targets. An increasing number of clinically and experimentally validated interactions between HCV and human proteins have been documented in public databases, facilitating studies based on computational methods. In this study, we proposed a new computational approach, rotation forest position-specific scoring matrix (RF-PSSM), to predict the interactions among HCV and human proteins. In particular, PSSM was used to characterize each protein, two-dimensional principal component analysis (2DPCA) was then adopted for feature extraction of PSSM. Finally, rotation forest (RF) was used to implement classification. The results of various ablation experiments show that on independent datasets, the accuracy and area under curve (AUC) value of RF-PSSM can reach 93.74^% and 94.29%, respectively, outperforming almost all cutting-edge research. In addition, we used RF-PSSM to predict 9 human proteins that may interact with HCV protein E1, which can provide theoretical guidance for future experimental studies.

丙型肝炎病毒（HCV）与人蛋白相互作用的鉴定不仅有助于我们了解相关疾病的分子机制，而且有助于发现新的药物靶点。公共数据库中记录了越来越多的临床和实验验证的丙型肝炎病毒和人类蛋白质之间的相互作用，促进了基于计算方法的研究。在这项研究中，我们提出了一种新的计算方法，即旋转森林位置特异性评分矩阵（RF-PSSM），来预测HCV和人类蛋白质之间的相互作用。特别地，使用PSSM对每种蛋白质进行表征，然后采用二维主成分分析（2DPCA）对PSSM进行特征提取。最后，利用轮作森林（RF）进行分类。各种消融实验的结果表明，在独立的数据集上，RF-PSSM的准确率和曲线下面积（AUC）值分别可达93.74%和94.29%，优于几乎所有的前沿研究。此外，我们使用RF-PSSM预测了9种可能与HCV蛋白E1相互作用的人类蛋白，这可以为未来的实验研究提供理论指导。

{"title":"RF-PSSM: A Combination of Rotation Forest Algorithm and Position-Specific Scoring Matrix for Improved Prediction of Protein-Protein Interactions Between Hepatitis C Virus and Human","authors":"Xin Liu;Yaping Lu;Liang Wang;Wei Geng;Xinyi Shi;Xiao Zhang","doi":"10.26599/BDMA.2022.9020031","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020031","url":null,"abstract":"The identification of hepatitis C virus (HCV) virus-human protein interactions will not only help us understand the molecular mechanisms of related diseases but also be conductive to discovering new drug targets. An increasing number of clinically and experimentally validated interactions between HCV and human proteins have been documented in public databases, facilitating studies based on computational methods. In this study, we proposed a new computational approach, rotation forest position-specific scoring matrix (RF-PSSM), to predict the interactions among HCV and human proteins. In particular, PSSM was used to characterize each protein, two-dimensional principal component analysis (2DPCA) was then adopted for feature extraction of PSSM. Finally, rotation forest (RF) was used to implement classification. The results of various ablation experiments show that on independent datasets, the accuracy and area under curve (AUC) value of RF-PSSM can reach 93.74\u0000<sup>%</sup>\u0000 and 94.29%, respectively, outperforming almost all cutting-edge research. In addition, we used RF-PSSM to predict 9 human proteins that may interact with HCV protein E1, which can provide theoretical guidance for future experimental studies.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 1","pages":"21-31"},"PeriodicalIF":13.6,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9962810/09962955.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68007734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Recommendation System with Biclustering 双集群推荐系统

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18 DOI: 10.26599/BDMA.2022.9020012

Jianjun Sun;Yu Zhang

The massive growth of online commercial data has raised the request for an automatic recommender system to benefit both users and merchants. One of the most frequently used recommendation methods is collaborative filtering, but its accuracy is limited by the sparsity of the rating dataset. Most existing collaborative filtering methods consider all features when calculating user/item similarity and ignore much local information. In collaborative filtering, selecting neighbors and determining users' similarities are the most important parts. For the selection of better neighbors, this study proposes a novel biclustering method based on modified fuzzy adaptive resonance theory. To reflect the similarity between users, a new measure that considers the effect of the number of users' common items is proposed. Specifically, the proposed novel biclustering method is first adopted to obtain local similarity and local prediction. Second, item-based collaborative filtering is used to generate global predictions. Finally, the two resultant predictions are fused to obtain a final one. Experiment results demonstrate that the proposed method outperforms state-of-the-art models in terms of several aspects on three benchmark datasets.

在线商业数据的巨大增长提出了对自动推荐系统的要求，以造福用户和商家。最常用的推荐方法之一是协同过滤，但其准确性受到评级数据集稀疏性的限制。大多数现有的协同过滤方法在计算用户/项目相似性时考虑了所有特征，忽略了许多局部信息。在协同过滤中，选择邻居和确定用户的相似性是最重要的部分。为了选择更好的邻居，本研究提出了一种基于改进的模糊自适应共振理论的新的双聚类方法。为了反映用户之间的相似性，提出了一种考虑用户常用项目数量影响的新度量方法。具体来说，首先采用所提出的新的双聚类方法来获得局部相似性和局部预测。其次，基于项目的协同过滤用于生成全局预测。最后，将两个结果预测进行融合以获得最终预测。实验结果表明，在三个基准数据集上，所提出的方法在几个方面优于最先进的模型。

{"title":"Recommendation System with Biclustering","authors":"Jianjun Sun;Yu Zhang","doi":"10.26599/BDMA.2022.9020012","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020012","url":null,"abstract":"The massive growth of online commercial data has raised the request for an automatic recommender system to benefit both users and merchants. One of the most frequently used recommendation methods is collaborative filtering, but its accuracy is limited by the sparsity of the rating dataset. Most existing collaborative filtering methods consider all features when calculating user/item similarity and ignore much local information. In collaborative filtering, selecting neighbors and determining users' similarities are the most important parts. For the selection of better neighbors, this study proposes a novel biclustering method based on modified fuzzy adaptive resonance theory. To reflect the similarity between users, a new measure that considers the effect of the number of users' common items is proposed. Specifically, the proposed novel biclustering method is first adopted to obtain local similarity and local prediction. Second, item-based collaborative filtering is used to generate global predictions. Finally, the two resultant predictions are fused to obtain a final one. Experiment results demonstrate that the proposed method outperforms state-of-the-art models in terms of several aspects on three benchmark datasets.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 4","pages":"282-293"},"PeriodicalIF":13.6,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9832761/09832768.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68067882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Effect of Feature Selection on the Prediction of Direct Normal Irradiance 特征选择对直接法向辐照度预测的影响

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18 DOI: 10.26599/BDMA.2022.9020003

Mohamed Khalifa Boutahir;Yousef Farhaoui;Mourade Azrour;Imad Zeroual;Ahmad El Allaoui

Solar radiation is capable of producing heat, causing chemical reactions, or generating electricity. Thus, the amount of solar radiation at different times of the day must be determined to design and equip all solar systems. Moreover, it is necessary to have a thorough understanding of different solar radiation components, such as Direct Normal Irradiance (DNI), Diffuse Horizontal Irradiance (DHI), and Global Horizontal Irradiance (GHI). Unfortunately, measurements of solar radiation are not easily accessible for the majority of regions on the globe. This paper aims to develop a set of deep learning models through feature importance algorithms to predict the DNI data. The proposed models are based on historical data of meteorological parameters and solar radiation properties in a specific location of the region of Errachidia, Morocco, from January 1, 2017, to December 31, 2019, with an interval of 60 minutes. The findings demonstrated that feature selection approaches play a crucial role in forecasting of solar radiation accurately when compared with the available data.

太阳辐射能够产生热量、引起化学反应或发电。因此，必须确定一天中不同时间的太阳辐射量，以设计和装备所有太阳能系统。此外，有必要深入了解不同的太阳辐射成分，如直接法向辐照度（DNI）、漫反射水平辐照度（DHI）和全局水平辐照度（GHI）。不幸的是，对全球大多数地区来说，太阳辐射的测量并不容易。本文旨在通过特征重要性算法开发一组深度学习模型来预测DNI数据。所提出的模型基于2017年1月1日至2019年12月31日摩洛哥埃拉希迪亚地区特定位置的气象参数和太阳辐射特性的历史数据，时间间隔为60分钟。研究结果表明，与现有数据相比，特征选择方法在准确预测太阳辐射方面发挥着至关重要的作用。

{"title":"Effect of Feature Selection on the Prediction of Direct Normal Irradiance","authors":"Mohamed Khalifa Boutahir;Yousef Farhaoui;Mourade Azrour;Imad Zeroual;Ahmad El Allaoui","doi":"10.26599/BDMA.2022.9020003","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020003","url":null,"abstract":"Solar radiation is capable of producing heat, causing chemical reactions, or generating electricity. Thus, the amount of solar radiation at different times of the day must be determined to design and equip all solar systems. Moreover, it is necessary to have a thorough understanding of different solar radiation components, such as Direct Normal Irradiance (DNI), Diffuse Horizontal Irradiance (DHI), and Global Horizontal Irradiance (GHI). Unfortunately, measurements of solar radiation are not easily accessible for the majority of regions on the globe. This paper aims to develop a set of deep learning models through feature importance algorithms to predict the DNI data. The proposed models are based on historical data of meteorological parameters and solar radiation properties in a specific location of the region of Errachidia, Morocco, from January 1, 2017, to December 31, 2019, with an interval of 60 minutes. The findings demonstrated that feature selection approaches play a crucial role in forecasting of solar radiation accurately when compared with the available data.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 4","pages":"309-317"},"PeriodicalIF":13.6,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9832761/09832772.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68067554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Total Contents 总目录

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18

引用次数: 0

$tautext{JOWL}$: A Systematic Approach to Build and Evolve a Temporal OWL 2 Ontology Based on Temporal JSON Big Data $tautext｛JOWL｝$：一种基于时态JSON大数据构建和演化时态OWL2本体的系统方法

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18 DOI: 10.26599/BDMA.2021.9020019

Zouhaier Brahmia;Fabio Grandi;Rafik Bouaziz

Nowadays, ontologies, which are defined under the OWL 2 Web Ontology Language (OWL 2), are being used in several fields like artificial intelligence, knowledge engineering, and Semantic Web environments to access data, answer queries, or infer new knowledge. In particular, ontologies can be used to model the semantics of big data as an enabling factor for the deployment of intelligent analytics. Big data are being widely stored and exchanged in JavaScript Object Notation (JSON) format, in particular by Web applications. However, JSON data collections lack explicit semantics as they are in general schema-less, which does not allow to efficiently leverage the benefits of big data. Furthermore, several applications require bookkeeping of the entire history of big data changes, for which no support is provided by mainstream Big Data management systems, including Not only SQL (NoSQL) database systems. In this paper, we propose an approach, named $tau text{JOWL}$ (temporal OWL 2 from temporal JSON), which allows users (i) to automatically build a temporal OWL 2 ontology of data, following the Closed World Assumption (CWA), from temporal JSON-based big data, and (ii) to manage its incremental maintenance accommodating the evolution of these data, in a temporal and multi-schema environment.

如今，在OWL2 Web本体语言（OWL2）下定义的本体正被用于人工智能、知识工程和语义Web环境等多个领域，以访问数据、回答查询或推断新知识。特别是，本体可以用于对大数据的语义进行建模，作为部署智能分析的一个有利因素。大数据以JavaScript Object Notation（JSON）格式被广泛存储和交换，尤其是通过Web应用程序。然而，JSON数据集合缺乏明确的语义，因为它们通常没有模式，这不允许有效地利用大数据的好处。此外，一些应用程序需要对大数据变化的整个历史进行记账，而主流大数据管理系统（包括Not only SQL（NoSQL）数据库系统）对此没有提供支持。在本文中，我们提出了一种名为$tautext｛JOWL｝$（时态JSON中的时态OWL2）的方法，该方法允许用户（i）根据封闭世界假设（CWA），从基于时态JSON的大数据中自动构建时态OWL2数据本体，以及（ii）在时态和多模式环境中管理其增量维护，以适应这些数据的演变。

{"title":"$tautext{JOWL}$: A Systematic Approach to Build and Evolve a Temporal OWL 2 Ontology Based on Temporal JSON Big Data","authors":"Zouhaier Brahmia;Fabio Grandi;Rafik Bouaziz","doi":"10.26599/BDMA.2021.9020019","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020019","url":null,"abstract":"Nowadays, ontologies, which are defined under the OWL 2 Web Ontology Language (OWL 2), are being used in several fields like artificial intelligence, knowledge engineering, and Semantic Web environments to access data, answer queries, or infer new knowledge. In particular, ontologies can be used to model the semantics of big data as an enabling factor for the deployment of intelligent analytics. Big data are being widely stored and exchanged in JavaScript Object Notation (JSON) format, in particular by Web applications. However, JSON data collections lack explicit semantics as they are in general schema-less, which does not allow to efficiently leverage the benefits of big data. Furthermore, several applications require bookkeeping of the entire history of big data changes, for which no support is provided by mainstream Big Data management systems, including Not only SQL (NoSQL) database systems. In this paper, we propose an approach, named \u0000<tex>$tau text{JOWL}$</tex>\u0000 (temporal OWL 2 from temporal JSON), which allows users (i) to automatically build a temporal OWL 2 ontology of data, following the Closed World Assumption (CWA), from temporal JSON-based big data, and (ii) to manage its incremental maintenance accommodating the evolution of these data, in a temporal and multi-schema environment.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 4","pages":"271-281"},"PeriodicalIF":13.6,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9832761/09832769.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68067551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of Internet of Things in the Health Sector: Toward Minimizing Energy Consumption 物联网在卫生领域的应用：实现能源消耗最小化

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18 DOI: 10.26599/BDMA.2021.9020031

Mohammed Moutaib;Tarik Ahajjam;Mohammed Fattah;Yousef Farhaoui;Badraddine Aghoutane;Moulhime El Bekkali

The Internet of Things (IoT) is currently reflected in the increase in the number of connected objects, that is, devices with their own identity and computing and communication capacities. IoT is recognized as one of the most critical areas for future technologies, gaining worldwide attention. It applies to many areas, where it has achieved success, such as healthcare, where a patient is monitored using nodes and lightweight sensors. However, the powerful functions of IoT in the medical field are based on communication, analysis, processing, and management of data autonomously without any manual intervention, which presents many difficulties, such as energy consumption. However, these issues significantly slow down the development and rapid deployment of this technology. The main causes of wasted energy from connected objects include collisions that occur when two or more nodes send data simultaneously and the leading cause of data retransmission that occurs when a collision occurs or when data are not received correctly due to channel fading. The distance between nodes is one of the factors influencing energy consumption. In this article, we have proposed direct communication between nodes to avoid collision domains, which will help reduce data retransmission. The results show that the distribution can ensure the performance of the system under general conditions compared to the centralization and to the existing works.

物联网（IoT）目前反映在连接对象数量的增加上，即具有自身身份、计算和通信能力的设备。物联网被公认为未来技术最关键的领域之一，受到全世界的关注。它适用于许多已经取得成功的领域，例如医疗保健，使用节点和轻量级传感器监测患者。然而，物联网在医疗领域的强大功能是基于数据的自主通信、分析、处理和管理，而无需任何人工干预，这带来了许多困难，如能耗。然而，这些问题大大减缓了这项技术的开发和快速部署。来自连接对象的能量浪费的主要原因包括当两个或多个节点同时发送数据时发生的冲突，以及当发生冲突或由于信道衰落导致数据未正确接收时发生的数据重传的主要原因。节点之间的距离是影响能耗的因素之一。在本文中，我们提出了节点之间的直接通信以避免域冲突，这将有助于减少数据重传。结果表明，与集中式和现有工作相比，分布式可以确保系统在一般条件下的性能。

{"title":"Application of Internet of Things in the Health Sector: Toward Minimizing Energy Consumption","authors":"Mohammed Moutaib;Tarik Ahajjam;Mohammed Fattah;Yousef Farhaoui;Badraddine Aghoutane;Moulhime El Bekkali","doi":"10.26599/BDMA.2021.9020031","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020031","url":null,"abstract":"The Internet of Things (IoT) is currently reflected in the increase in the number of connected objects, that is, devices with their own identity and computing and communication capacities. IoT is recognized as one of the most critical areas for future technologies, gaining worldwide attention. It applies to many areas, where it has achieved success, such as healthcare, where a patient is monitored using nodes and lightweight sensors. However, the powerful functions of IoT in the medical field are based on communication, analysis, processing, and management of data autonomously without any manual intervention, which presents many difficulties, such as energy consumption. However, these issues significantly slow down the development and rapid deployment of this technology. The main causes of wasted energy from connected objects include collisions that occur when two or more nodes send data simultaneously and the leading cause of data retransmission that occurs when a collision occurs or when data are not received correctly due to channel fading. The distance between nodes is one of the factors influencing energy consumption. In this article, we have proposed direct communication between nodes to avoid collision domains, which will help reduce data retransmission. The results show that the distribution can ensure the performance of the system under general conditions compared to the centralization and to the existing works.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 4","pages":"302-308"},"PeriodicalIF":13.6,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9832761/09832765.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68067553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Predicting Students' Final Performance Using Artificial Neural Networks 用人工神经网络预测学生的期末成绩

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18 DOI: 10.26599/BDMA.2021.9020030

Tarik Ahajjam;Mohammed Moutaib;Haidar Aissa;Mourad Azrour;Yousef Farhaoui;Mohammed Fattah

Artificial Intelligence (AI) is based on algorithms that allow machines to make decisions for humans. This technology enhances the users' experience in various ways. Several studies have been conducted in the field of education to solve the problem of student orientation and performance using various Machine Learning (ML) algorithms. The main goal of this article is to predict Moroccan students' performance in the region of Guelmim Oued Noun using an intelligent system based on neural networks, one of the best data mining techniques that provided us with the best results.

人工智能（AI）是基于允许机器为人类做出决策的算法。这项技术以多种方式增强了用户的体验。在教育领域已经进行了几项研究，以使用各种机器学习（ML）算法来解决学生的定向和表现问题。本文的主要目标是使用基于神经网络的智能系统预测摩洛哥学生在Guelmim Oued Noun地区的表现，神经网络是最好的数据挖掘技术之一，为我们提供了最好的结果。

引用次数: 2

Influencing Factors and Clustering Characteristics of COVID-19: A Global Analysis 新冠肺炎疫情影响因素及聚集特征的全球分析

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-07-18 DOI: 10.26599/BDMA.2022.9020010

Tianlong Zheng;Chunli Zhang;Yueting Shi;Debao Chen;Sheng Liu

The unprecedented coronavirus disease 2019 (COVID-19) pandemic is still raging (in year 2021) in many countries worldwide. Various response strategies to study the characteristics and distributions of the virus in various regions of the world have been developed to assist in the prevention and control of this epidemic. Descriptive statistics and regression analysis on COVID-19 data from different countries were conducted in this study to compare and evaluate various regression models. Results showed that the extreme random forest regression (ERFR) model had the best performance, and factors such as population density, ozone, median age, life expectancy, and Human Development Index (HDI) were relatively influential on the spread and diffusion of COVID-19 in the ERFR model. In addition, the epidemic clustering characteristics were analyzed through the spectral clustering algorithm. The visualization results of spectral clustering showed that the geographical distribution of global COVID-19 pandemic spread formation was highly clustered, and its clustering characteristics and influencing factors also exhibited some consistency in distribution. This study aims to deepen the understanding of the international community regarding the global COVID-19 pandemic to develop measures for countries worldwide to mitigate potential large-scale outbreaks and improve the ability to respond to such public health emergencies.

前所未有的2019冠状病毒病（新冠肺炎）大流行仍在全球许多国家肆虐（2021年）。已经制定了各种应对策略来研究病毒在世界各个地区的特征和分布，以协助预防和控制这一流行病。本研究对不同国家新冠肺炎数据进行描述性统计和回归分析，对各种回归模型进行比较和评价。结果表明，极端随机森林回归（ERFR）模型的性能最好，人口密度、臭氧、中位年龄、预期寿命和人类发展指数（HDI）等因素对ERFR模型中新冠肺炎的传播和扩散影响相对较大。此外，通过谱聚类算法分析了疫情的聚类特征。光谱聚类的可视化结果表明，全球新冠肺炎疫情扩散形成的地理分布具有高度的聚类性，其聚类特征和影响因素在分布上也表现出一定的一致性。本研究旨在加深国际社会对全球新冠肺炎大流行的理解，为世界各国制定措施，缓解潜在的大规模疫情，提高应对此类突发公共卫生事件的能力。

{"title":"Influencing Factors and Clustering Characteristics of COVID-19: A Global Analysis","authors":"Tianlong Zheng;Chunli Zhang;Yueting Shi;Debao Chen;Sheng Liu","doi":"10.26599/BDMA.2022.9020010","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020010","url":null,"abstract":"The unprecedented coronavirus disease 2019 (COVID-19) pandemic is still raging (in year 2021) in many countries worldwide. Various response strategies to study the characteristics and distributions of the virus in various regions of the world have been developed to assist in the prevention and control of this epidemic. Descriptive statistics and regression analysis on COVID-19 data from different countries were conducted in this study to compare and evaluate various regression models. Results showed that the extreme random forest regression (ERFR) model had the best performance, and factors such as population density, ozone, median age, life expectancy, and Human Development Index (HDI) were relatively influential on the spread and diffusion of COVID-19 in the ERFR model. In addition, the epidemic clustering characteristics were analyzed through the spectral clustering algorithm. The visualization results of spectral clustering showed that the geographical distribution of global COVID-19 pandemic spread formation was highly clustered, and its clustering characteristics and influencing factors also exhibited some consistency in distribution. This study aims to deepen the understanding of the international community regarding the global COVID-19 pandemic to develop measures for countries worldwide to mitigate potential large-scale outbreaks and improve the ability to respond to such public health emergencies.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 4","pages":"318-338"},"PeriodicalIF":13.6,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9832761/09832767.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68068250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling 随机分层子采样对协同滤波性能和效率的最优依赖性

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-06-09 DOI: 10.26599/BDMA.2021.9020032

Samin Poudel;Marwan Bikdash

Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1–FID), are proportional to the square root of TTRF.

明智地丢弃部分用户或项目可以降低协同过滤（CF）算法的计算成本。这种二次采样对CF的计算时间和精度的影响尚不完全清楚，也没有选择最佳甚至适当的二次采样水平的明确指南。在本文中，我们提出了一种使用聚类的基于密度的随机分层子采样（DRSC）算法，其中指定了所需的丢弃用户分数（FUD）和丢弃项目分数（FID），并保持子采样期间的总体密度。随后，我们基于对应用于MovieLens、Yahoo Music Rating和Amazon Automotive数据中的各种主矩阵的七种标准CF算法的广泛模拟，开发了作为FUD和FID函数的训练时间改进（TTI）和精度损失（AL）的简单模型。仿真表明，对于所有七种方法，在FID和FUD中，TTI和缩放的AL都是双线性的。CF方法的TTI线性回归似乎对所有数据集都是相同的。大量模拟表明，仅使用FUD和FID就可以可靠地估计TTI，但AL需要考虑额外的数据集特征。然后，将导出的模型用于优化子采样水平，以解决TTI和AL之间的折衷问题。找到了一个简单的次优近似，其中，对于较高的TTRF值，最佳AL与最佳训练时间缩减因子（TTRF）成比例，而最佳子采样水平（如最佳FID/（1–FID））与TTRF的平方根成比例。

{"title":"Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling","authors":"Samin Poudel;Marwan Bikdash","doi":"10.26599/BDMA.2021.9020032","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020032","url":null,"abstract":"Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1–FID), are proportional to the square root of TTRF.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 3","pages":"192-205"},"PeriodicalIF":13.6,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9793354/09793360.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67848636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Deep Feature Learning for Intrinsic Signature Based Camera Discrimination 基于特征识别的摄像头识别的深度特征学习

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-06-09 DOI: 10.26599/BDMA.2022.9020006

Chaity Banerjee;Tharun Kumar Doppalapudi;Eduardo Pasiliao;Tathagata Mukherjee

In this paper we consider the problem of “end-to-end” digital camera identification by considering sequence of images obtained from the cameras. The problem of digital camera identification is harder than the problem of identifying its analog counterpart since the process of analog to digital conversion smooths out the intrinsic noise in the analog signal. However it is known that identifying a digital camera is possible by analyzing the camera's intrinsic sensor artifacts that are introduced into the images/videos during the process of photo/video capture. It is known that such methods are computationally intensive requiring expensive pre-processing steps. In this paper we propose an end-to-end deep feature learning framework for identifying cameras using images obtained from them. We conduct experiments using three custom datasets: the first containing two cameras in an indoor environment where each camera may observe different scenes having no overlapping features, the second containing images from four cameras in an outdoor setting but where each camera observes scenes having overlapping features and the third containing images from two cameras observing the same checkerboard pattern in an indoor setting. Our results show that it is possible to capture the intrinsic hardware signature of the cameras using deep feature representations in an end-to-end framework. These deep feature maps can in turn be used to disambiguate the cameras from each another. Our system is end-to-end, requires no complicated pre-processing steps and the trained model is computationally efficient during testing, paving a way to have near instantaneous decisions for the problem of digital camera identification in production environments. Finally we present comparisons against the current state-of-the-art in digital camera identification which clearly establishes the superiority of the end-to-end solution.

在本文中，我们通过考虑从相机获得的图像序列来考虑“端到端”数字相机识别问题。数码相机识别的问题比识别其模拟对应物的问题更难，因为模数转换的过程会消除模拟信号中的固有噪声。然而，已知通过分析在照片/视频捕获过程中引入到图像/视频中的相机的固有传感器伪像来识别数字相机是可能的。已知这样的方法是计算密集型的，需要昂贵的预处理步骤。在本文中，我们提出了一种端到端的深度特征学习框架，用于使用从相机获得的图像来识别相机。我们使用三个自定义数据集进行实验：第一个数据集包含室内环境中的两个相机，每个相机可以观察到没有重叠特征的不同场景，第二个包含来自室外设置中的四个相机的图像，但是其中每个相机观察具有重叠特征的场景，第三个包含来自室内设置中观察相同棋盘图案的两个相机的照片。我们的结果表明，在端到端框架中使用深度特征表示来捕捉相机的内在硬件特征是可能的。这些深层特征图反过来可以用来消除相机之间的歧义。我们的系统是端到端的，不需要复杂的预处理步骤，并且训练的模型在测试过程中计算高效，为生产环境中的数码相机识别问题提供了近乎即时的决策。最后，我们将其与当前最先进的数码相机识别技术进行了比较，这清楚地证明了端到端解决方案的优越性。

{"title":"Deep Feature Learning for Intrinsic Signature Based Camera Discrimination","authors":"Chaity Banerjee;Tharun Kumar Doppalapudi;Eduardo Pasiliao;Tathagata Mukherjee","doi":"10.26599/BDMA.2022.9020006","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020006","url":null,"abstract":"In this paper we consider the problem of “end-to-end” digital camera identification by considering sequence of images obtained from the cameras. The problem of digital camera identification is harder than the problem of identifying its analog counterpart since the process of analog to digital conversion smooths out the intrinsic noise in the analog signal. However it is known that identifying a digital camera is possible by analyzing the camera's intrinsic sensor artifacts that are introduced into the images/videos during the process of photo/video capture. It is known that such methods are computationally intensive requiring expensive pre-processing steps. In this paper we propose an end-to-end deep feature learning framework for identifying cameras using images obtained from them. We conduct experiments using three custom datasets: the first containing two cameras in an indoor environment where each camera may observe different scenes having no overlapping features, the second containing images from four cameras in an outdoor setting but where each camera observes scenes having overlapping features and the third containing images from two cameras observing the same checkerboard pattern in an indoor setting. Our results show that it is possible to capture the intrinsic hardware signature of the cameras using deep feature representations in an end-to-end framework. These deep feature maps can in turn be used to disambiguate the cameras from each another. Our system is end-to-end, requires no complicated pre-processing steps and the trained model is computationally efficient during testing, paving a way to have near instantaneous decisions for the problem of digital camera identification in production environments. Finally we present comparisons against the current state-of-the-art in digital camera identification which clearly establishes the superiority of the end-to-end solution.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 3","pages":"206-227"},"PeriodicalIF":13.6,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9793354/09793358.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68010339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0