首页 > 最新文献

Big Data最新文献

英文 中文
Special Issue: Big Scientific Data and Machine Learning in Science and Engineering. 特刊:科学与工程中的大科学数据和机器学习。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2024-07-31 DOI: 10.1089/big.2024.59218.kpa
Farhad Pourkamali-Anaraki
{"title":"Special Issue: Big Scientific Data and Machine Learning in Science and Engineering.","authors":"Farhad Pourkamali-Anaraki","doi":"10.1089/big.2024.59218.kpa","DOIUrl":"10.1089/big.2024.59218.kpa","url":null,"abstract":"","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Unified Training Process for Fake News Detection Based on Finetuned Bidirectional Encoder Representation from Transformers Model. 基于变压器模型微调双向编码器表示的假新闻检测统一训练流程
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-03-22 DOI: 10.1089/big.2022.0050
Vijay Srinivas Tida, Sonya Hsu, Xiali Hei

An efficient fake news detector becomes essential as the accessibility of social media platforms increases rapidly. Previous studies mainly focused on designing the models solely based on individual data sets and might suffer from degradable performance. Therefore, developing a robust model for a combined data set with diverse knowledge becomes crucial. However, designing the model with a combined data set requires extensive training time and sequential workload to obtain optimal performance without having some prior knowledge about the model's parameters. The presented study here will help solve these issues by introducing the unified training strategy to have a base structure for the classifier and all hyperparameters from individual models using a pretrained transformer model. The performance of the proposed model is noted using three publicly available data sets, namely ISOT and others from the Kaggle website. The results indicate that the proposed unified training strategy surpassed the existing models such as Random Forests, convolutional neural networks, and long short-term memory, with 97% accuracy and achieved the F1 score of 0.97. Furthermore, there was a significant reduction in training time by almost 1.5 to 1.8 × by removing words lower than three letters from the input samples. We also did extensive performance analysis by varying the number of encoder blocks to build compact models and trained on the combined data set. We justify that reducing encoder blocks resulted in lower performance from the obtained results.

随着社交媒体平台访问量的快速增长,高效的假新闻检测器变得至关重要。以往的研究主要侧重于仅根据单个数据集设计模型,可能会出现性能下降的问题。因此,为具有不同知识的组合数据集开发一个稳健的模型变得至关重要。然而,利用组合数据集设计模型需要大量的训练时间和连续的工作量,才能在事先不了解模型参数的情况下获得最佳性能。本文提出的研究将通过引入统一的训练策略来帮助解决这些问题,即使用一个预训练的转换器模型,为分类器和来自各个模型的所有超参数建立一个基础结构。使用三个公开数据集(即 ISOT 和 Kaggle 网站上的其他数据集)对所提议模型的性能进行了测试。结果表明,所提出的统一训练策略超越了随机森林、卷积神经网络和长短期记忆等现有模型,准确率达到 97%,F1 得分为 0.97。此外,通过从输入样本中剔除小于三个字母的单词,训练时间大幅减少了近 1.5 到 1.8 倍。我们还通过改变编码器块的数量来建立紧凑的模型,并在组合数据集上进行训练,从而进行了广泛的性能分析。从获得的结果来看,我们认为减少编码器块会降低性能。
{"title":"A Unified Training Process for Fake News Detection Based on Finetuned Bidirectional Encoder Representation from Transformers Model.","authors":"Vijay Srinivas Tida, Sonya Hsu, Xiali Hei","doi":"10.1089/big.2022.0050","DOIUrl":"10.1089/big.2022.0050","url":null,"abstract":"<p><p>An efficient fake news detector becomes essential as the accessibility of social media platforms increases rapidly. Previous studies mainly focused on designing the models solely based on individual data sets and might suffer from degradable performance. Therefore, developing a robust model for a combined data set with diverse knowledge becomes crucial. However, designing the model with a combined data set requires extensive training time and sequential workload to obtain optimal performance without having some prior knowledge about the model's parameters. The presented study here will help solve these issues by introducing the unified training strategy to have a base structure for the classifier and all hyperparameters from individual models using a pretrained transformer model. The performance of the proposed model is noted using three publicly available data sets, namely ISOT and others from the Kaggle website. The results indicate that the proposed unified training strategy surpassed the existing models such as Random Forests, convolutional neural networks, and long short-term memory, with 97% accuracy and achieved the F1 score of 0.97. Furthermore, there was a significant reduction in training time by almost 1.5 to 1.8 × by removing words lower than three letters from the input samples. We also did extensive performance analysis by varying the number of encoder blocks to build compact models and trained on the combined data set. We justify that reducing encoder blocks resulted in lower performance from the obtained results.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9150389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data. 基于有效范围的基因表达数据分类过滤新方法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-09-04 DOI: 10.1089/big.2022.0086
Derya Turfan, Bulent Altunkaynak, Özgür Yeniay

Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.

多年来,为了减少和消除疾病对人类健康的影响,人们开展了许多研究。基因表达数据集在诊断和治疗疾病方面发挥着至关重要的作用。这些数据集由数千个基因和少量样本组成。这种情况造成了 "维度诅咒",使分析这类数据集成为难题。解决这一问题的最有效策略之一就是特征选择方法。特征选择是一种预处理步骤,通过选择最相关、信息量最大的特征来提高分类性能,同时提高分类的准确性。在本文中,我们为特征选择方法提出了一种新的基于统计的过滤方法,命名为基于有效范围的特征选择算法(FSAER)。作为之前基于有效范围的基因选择算法(ERGS)和基于有效范围的改进特征选择算法(IFSER)的扩展,我们的新方法既包含了这两种方法的优点,又考虑到了不相交区域。为了说明所提算法的有效性,我们在六个基准基因表达数据集上进行了实验。通过比较 FSAER 和其他滤波方法的分类准确率,证明了所提方法的有效性。在分类方法中,使用了支持向量机、天真贝叶斯分类器和 k 近邻算法。
{"title":"A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data.","authors":"Derya Turfan, Bulent Altunkaynak, Özgür Yeniay","doi":"10.1089/big.2022.0086","DOIUrl":"10.1089/big.2022.0086","url":null,"abstract":"<p><p>Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10211345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Generalized Regularized Extreme Learning Machine Through Gradient-Based Optimizer Model for Self-Cleansing Nondeposition with Clean Bed Mode of Sediment Transport. 基于梯度优化器的混合广义正则化极限学习机模型,用于自清洁非沉积与清洁床模式的沉积物输送。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-03-07 DOI: 10.1089/big.2022.0120
Enes Gul, Mir Jafar Sadegh Safari

Sediment transport modeling is an important problem to minimize sedimentation in open channels that could lead to unexpected operation expenses. From an engineering perspective, the development of accurate models based on effective variables involved for flow velocity computation could provide a reliable solution in channel design. Furthermore, validity of sediment transport models is linked to the range of data used for the model development. Existing design models were established on the limited data ranges. Thus, the present study aimed to utilize all experimental data available in the literature, including recently published datasets that covered an extensive range of hydraulic properties. Extreme learning machine (ELM) algorithm and generalized regularized extreme learning machine (GRELM) were implemented for the modeling, and then, particle swarm optimization (PSO) and gradient-based optimizer (GBO) were utilized for the hybridization of ELM and GRELM. GRELM-PSO and GRELM-GBO findings were compared to the standalone ELM, GRELM, and existing regression models to determine their accurate computations. The analysis of the models demonstrated the robustness of the models that incorporate channel parameter. The poor results of some existing regression models seem to be linked to the disregarding of the channel parameter. Statistical analysis of the model outcomes illustrated the outperformance of GRELM-GBO in contrast to the ELM, GRELM, GRELM-PSO, and regression models, although GRELM-GBO performed slightly better when compared to the GRELM-PSO counterpart. It was found that the mean accuracy of GRELM-GBO was 18.5% better when compared to the best regression model. The promising findings of the current study not only may encourage the use of recommended algorithms for channel design in practice but also may further the application of novel ELM-based methods in alternative environmental problems.

泥沙输运模型是一个重要问题,可最大限度地减少明渠中的泥沙淤积,从而减少意外的运行费用。从工程角度看,根据流速计算所涉及的有效变量开发精确模型,可为渠道设计提供可靠的解决方案。此外,泥沙输运模型的有效性与模型开发所使用的数据范围有关。现有的设计模型是在有限的数据范围内建立的。因此,本研究旨在利用文献中的所有实验数据,包括最近发表的涵盖广泛水力特性的数据集。在建模过程中采用了极限学习机(ELM)算法和广义正则化极限学习机(GRELM),然后利用粒子群优化(PSO)和基于梯度的优化器(GBO)对 ELM 和 GRELM 进行混合。GRELM-PSO 和 GRELM-GBO 的结果与独立的 ELM、GRELM 和现有回归模型进行了比较,以确定其计算的准确性。对模型的分析表明,包含信道参数的模型具有稳健性。一些现有回归模型的结果不佳,似乎与忽略信道参数有关。对模型结果的统计分析表明,GRELM-GBO 的性能优于 ELM、GRELM、GRELM-PSO 和回归模型,但 GRELM-GBO 的性能略高于 GRELM-PSO。研究发现,与最佳回归模型相比,GRELM-GBO 的平均准确率高出 18.5%。当前研究的良好结果不仅可以鼓励在实践中使用推荐算法进行通道设计,还可以进一步推动基于 ELM 的新型方法在其他环境问题中的应用。
{"title":"Hybrid Generalized Regularized Extreme Learning Machine Through Gradient-Based Optimizer Model for Self-Cleansing Nondeposition with Clean Bed Mode of Sediment Transport.","authors":"Enes Gul, Mir Jafar Sadegh Safari","doi":"10.1089/big.2022.0120","DOIUrl":"10.1089/big.2022.0120","url":null,"abstract":"<p><p>Sediment transport modeling is an important problem to minimize sedimentation in open channels that could lead to unexpected operation expenses. From an engineering perspective, the development of accurate models based on effective variables involved for flow velocity computation could provide a reliable solution in channel design. Furthermore, validity of sediment transport models is linked to the range of data used for the model development. Existing design models were established on the limited data ranges. Thus, the present study aimed to utilize all experimental data available in the literature, including recently published datasets that covered an extensive range of hydraulic properties. Extreme learning machine (ELM) algorithm and generalized regularized extreme learning machine (GRELM) were implemented for the modeling, and then, particle swarm optimization (PSO) and gradient-based optimizer (GBO) were utilized for the hybridization of ELM and GRELM. GRELM-PSO and GRELM-GBO findings were compared to the standalone ELM, GRELM, and existing regression models to determine their accurate computations. The analysis of the models demonstrated the robustness of the models that incorporate channel parameter. The poor results of some existing regression models seem to be linked to the disregarding of the channel parameter. Statistical analysis of the model outcomes illustrated the outperformance of GRELM-GBO in contrast to the ELM, GRELM, GRELM-PSO, and regression models, although GRELM-GBO performed slightly better when compared to the GRELM-PSO counterpart. It was found that the mean accuracy of GRELM-GBO was 18.5% better when compared to the best regression model. The promising findings of the current study not only may encourage the use of recommended algorithms for channel design in practice but also may further the application of novel ELM-based methods in alternative environmental problems.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10861174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vertical and Horizontal Water Penetration Velocity Modeling in Nonhomogenous Soil Using Fast Multi-Output Relevance Vector Regression. 利用快速多输出相关性矢量回归建立非同质土壤的垂直和水平透水速度模型
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-03-14 DOI: 10.1089/big.2022.0125
Babak Vaheddoost, Shervin Rahimzadeh Arashloo, Mir Jafar Sadegh Safari

A joint determination of horizontal and vertical movement of water through porous medium is addressed in this study through fast multi-output relevance vector regression (FMRVR). To do this, an experimental data set conducted in a sand box with 300 × 300 × 150 mm dimensions made of Plexiglas is used. A random mixture of sand having size of 0.5-1 mm is used to simulate the porous medium. Within the experiments, 2, 3, 7, and 12 cm walls are used together with different injection locations as 130.7, 91.3, and 51.8 mm measured from the cutoff wall at the upstream. Then, the Cartesian coordinated of the tracer, time interval, length of the wall in each setup, and two dummy variables for determination of the initial point are considered as independent variables for joint estimation of horizontal and vertical velocity of water movement in the porous medium. Alternatively, the multi-linear regression, random forest, and the support vector regression approaches are used to alternate the results obtained by the FMRVR method. It was concluded that the FMRVR outperforms the other models, while the uncertainty in estimation of horizontal penetration is larger than the vertical one.

本研究通过快速多输出相关性向量回归(FMRVR)来联合确定水在多孔介质中的水平和垂直运动。为此,使用了在有机玻璃制成的尺寸为 300 × 300 × 150 毫米的沙箱中进行的实验数据集。使用粒度为 0.5-1 毫米的随机混合物来模拟多孔介质。在实验中,使用了 2、3、7 和 12 厘米的箱壁,以及不同的注入位置,即从上游截壁测量的 130.7、91.3 和 51.8 毫米。然后,将示踪剂的笛卡尔坐标、时间间隔、每个设置中的壁长以及两个用于确定初始点的虚拟变量作为自变量,共同估算多孔介质中水的水平和垂直运动速度。此外,还使用了多线性回归、随机森林和支持向量回归等方法来交替使用 FMRVR 方法得出的结果。得出的结论是,FMRVR 的效果优于其他模型,但水平渗透估算的不确定性大于垂直渗透估算。
{"title":"Vertical and Horizontal Water Penetration Velocity Modeling in Nonhomogenous Soil Using Fast Multi-Output Relevance Vector Regression.","authors":"Babak Vaheddoost, Shervin Rahimzadeh Arashloo, Mir Jafar Sadegh Safari","doi":"10.1089/big.2022.0125","DOIUrl":"10.1089/big.2022.0125","url":null,"abstract":"<p><p>A joint determination of horizontal and vertical movement of water through porous medium is addressed in this study through fast multi-output relevance vector regression (FMRVR). To do this, an experimental data set conducted in a sand box with 300 × 300 × 150 mm dimensions made of Plexiglas is used. A random mixture of sand having size of 0.5-1 mm is used to simulate the porous medium. Within the experiments, 2, 3, 7, and 12 cm walls are used together with different injection locations as 130.7, 91.3, and 51.8 mm measured from the cutoff wall at the upstream. Then, the Cartesian coordinated of the tracer, time interval, length of the wall in each setup, and two dummy variables for determination of the initial point are considered as independent variables for joint estimation of horizontal and vertical velocity of water movement in the porous medium. Alternatively, the multi-linear regression, random forest, and the support vector regression approaches are used to alternate the results obtained by the FMRVR method. It was concluded that the FMRVR outperforms the other models, while the uncertainty in estimation of horizontal penetration is larger than the vertical one.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9105192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kriging, Polynomial Chaos Expansion, and Low-Rank Approximations in Material Science and Big Data Analytics. 材料科学和大数据分析中的克里金法、多项式混沌展开和低域近似。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-04-24 DOI: 10.1089/big.2022.0124
Golsa Mahdavi, Mohammad Amin Hariri-Ardebili

In material science and engineering, the estimation of material properties and their failure modes is associated with physical experiments followed by modeling and optimization. However, proper optimization is challenging and computationally expensive. The main reason is the highly nonlinear behavior of brittle materials such as concrete. In this study, the application of surrogate models to predict the mechanical characteristics of concrete is investigated. Specifically, meta-models such as polynomial chaos expansion, Kriging, and canonical low-rank approximation are used for predicting the compressive strength of two different types of concrete (collected from experimental data in the literature). Various assumptions in surrogate models are examined, and the accuracy of each one is evaluated for the problem at hand. Finally, the optimal solution is provided. This study paves the road for other applications of surrogate models in material science and engineering.

在材料科学与工程领域,材料特性及其失效模式的估算与物理实验有关,随后是建模和优化。然而,适当的优化具有挑战性,且计算成本高昂。主要原因在于混凝土等脆性材料的高度非线性行为。在本研究中,将研究如何应用代用模型来预测混凝土的力学特性。具体来说,多项式混沌扩展、克里金法和典型低阶近似等元模型被用于预测两种不同类型混凝土的抗压强度(从文献中收集的实验数据)。研究了代用模型中的各种假设,并针对当前问题评估了每种假设的准确性。最后,提供了最佳解决方案。这项研究为代用模型在材料科学与工程领域的其他应用铺平了道路。
{"title":"Kriging, Polynomial Chaos Expansion, and Low-Rank Approximations in Material Science and Big Data Analytics.","authors":"Golsa Mahdavi, Mohammad Amin Hariri-Ardebili","doi":"10.1089/big.2022.0124","DOIUrl":"10.1089/big.2022.0124","url":null,"abstract":"<p><p>In material science and engineering, the estimation of material properties and their failure modes is associated with physical experiments followed by modeling and optimization. However, proper optimization is challenging and computationally expensive. The main reason is the highly nonlinear behavior of brittle materials such as concrete. In this study, the application of surrogate models to predict the mechanical characteristics of concrete is investigated. Specifically, meta-models such as polynomial chaos expansion, Kriging, and canonical low-rank approximation are used for predicting the compressive strength of two different types of concrete (collected from experimental data in the literature). Various assumptions in surrogate models are examined, and the accuracy of each one is evaluated for the problem at hand. Finally, the optimal solution is provided. This study paves the road for other applications of surrogate models in material science and engineering.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9446353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on the Influence of Information Iterative Propagation on Complex Network Structure. 信息迭代传播对复杂网络结构的影响研究。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-27 DOI: 10.1089/big.2023.0016
Yinuo Qian, Fuzhong Nian, Zheming Wang, Yabing Yao

Dynamic propagation will affect the change of network structure. Different networks are affected by the iterative propagation of information to different degrees. The iterative propagation of information in the network changes the connection strength of the chain edge between nodes. Most studies on temporal networks build networks based on time characteristics, and the iterative propagation of information in the network can also reflect the time characteristics of network evolution. The change of network structure is a macromanifestation of time characteristics, whereas the dynamics in the network is a micromanifestation of time characteristics. How to concretely visualize the change of network structure influenced by the characteristics of propagation dynamics has become the focus of this article. The appearance of chain edge is the micro change of network structure, and the division of community is the macro change of network structure. Based on this, the node participation is proposed to quantify the influence of different users on the information propagation in the network, and it is simulated in different types of networks. By analyzing the iterative propagation of information, the weighted network of different networks based on the iterative propagation of information is constructed. Finally, the chain edge and community division in the network are analyzed to achieve the purpose of quantifying the influence of network propagation on complex network structure.

动态传播会影响网络结构的变化。不同的网络受信息迭代传播的影响程度不同。网络中信息的迭代传播会改变节点间链边的连接强度。大多数关于时态网络的研究都是基于时间特征来构建网络的,网络中信息的迭代传播也能反映网络演化的时间特征。网络结构的变化是时间特征的宏观体现,而网络中的动态变化则是时间特征的微观体现。如何具体直观地体现传播动力学特征对网络结构变化的影响,成为本文讨论的重点。链边的出现是网络结构的微观变化,社区的划分是网络结构的宏观变化。在此基础上,提出了节点参与度来量化不同用户对网络信息传播的影响,并在不同类型的网络中进行了模拟。通过对信息迭代传播的分析,构建了基于信息迭代传播的不同网络的加权网络。最后,通过分析网络中的链边和社区划分,达到量化网络传播对复杂网络结构影响的目的。
{"title":"Research on the Influence of Information Iterative Propagation on Complex Network Structure.","authors":"Yinuo Qian, Fuzhong Nian, Zheming Wang, Yabing Yao","doi":"10.1089/big.2023.0016","DOIUrl":"https://doi.org/10.1089/big.2023.0016","url":null,"abstract":"<p><p>Dynamic propagation will affect the change of network structure. Different networks are affected by the iterative propagation of information to different degrees. The iterative propagation of information in the network changes the connection strength of the chain edge between nodes. Most studies on temporal networks build networks based on time characteristics, and the iterative propagation of information in the network can also reflect the time characteristics of network evolution. The change of network structure is a macromanifestation of time characteristics, whereas the dynamics in the network is a micromanifestation of time characteristics. How to concretely visualize the change of network structure influenced by the characteristics of propagation dynamics has become the focus of this article. The appearance of chain edge is the micro change of network structure, and the division of community is the macro change of network structure. Based on this, the node participation is proposed to quantify the influence of different users on the information propagation in the network, and it is simulated in different types of networks. By analyzing the iterative propagation of information, the weighted network of different networks based on the iterative propagation of information is constructed. Finally, the chain edge and community division in the network are analyzed to achieve the purpose of quantifying the influence of network propagation on complex network structure.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141789804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast Survival Support Vector Regression Approach to Large Scale Credit Scoring via Safe Screening. 通过安全筛选进行大规模信用评分的快速生存支持向量回归方法。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-23 DOI: 10.1089/big.2023.0033
Hong Wang, Ling Hong

Survival models have found wider and wider applications in credit scoring recently due to their ability to estimate the dynamics of risk over time. In this research, we propose a Buckley-James safe sample screening support vector regression (BJS4VR) algorithm to model large-scale survival data by combing the Buckley-James transformation and support vector regression. Different from previous support vector regression survival models, censored samples here are imputed using a censoring unbiased Buckley-James estimator. Safe sample screening is then applied to discard samples that guaranteed to be non-active at the final optimal solution from the original data to improve efficiency. Experimental results on the large-scale real lending club loan data have shown that the proposed BJS4VR model outperforms existing popular survival models such as RSFM, CoxRidge and CoxBoost in terms of both prediction accuracy and time efficiency. Important variables highly correlated with credit risk are also identified with the proposed method.

由于生存模型能够估计随时间变化的风险动态,因此近来在信用评分领域得到了越来越广泛的应用。在这项研究中,我们提出了一种巴克利-詹姆斯安全样本筛选支持向量回归(BJS4VR)算法,通过结合巴克利-詹姆斯变换和支持向量回归,对大规模生存数据进行建模。与以往的支持向量回归生存模型不同,这里的删减样本是使用删减无偏的巴克利-詹姆斯估计器来估算的。然后应用安全样本筛选,从原始数据中剔除保证在最终最优解中不活跃的样本,以提高效率。在大规模真实借贷俱乐部贷款数据上的实验结果表明,所提出的 BJS4VR 模型在预测准确性和时间效率方面都优于现有的流行生存模型,如 RSFM、CoxRidge 和 CoxBoost。此外,所提出的方法还识别出了与信贷风险高度相关的重要变量。
{"title":"A Fast Survival Support Vector Regression Approach to Large Scale Credit Scoring via Safe Screening.","authors":"Hong Wang, Ling Hong","doi":"10.1089/big.2023.0033","DOIUrl":"https://doi.org/10.1089/big.2023.0033","url":null,"abstract":"<p><p>Survival models have found wider and wider applications in credit scoring recently due to their ability to estimate the dynamics of risk over time. In this research, we propose a Buckley-James safe sample screening support vector regression (BJS4VR) algorithm to model large-scale survival data by combing the Buckley-James transformation and support vector regression. Different from previous support vector regression survival models, censored samples here are imputed using a censoring unbiased Buckley-James estimator. Safe sample screening is then applied to discard samples that guaranteed to be non-active at the final optimal solution from the original data to improve efficiency. Experimental results on the large-scale real lending club loan data have shown that the proposed BJS4VR model outperforms existing popular survival models such as RSFM, CoxRidge and CoxBoost in terms of both prediction accuracy and time efficiency. Important variables highly correlated with credit risk are also identified with the proposed method.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Content-Aware Human Mobility Pattern Extraction. 内容感知的人类移动模式提取。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-10 DOI: 10.1089/big.2022.0281
Shengwen Li, Chaofan Fan, Tianci Li, Renyao Chen, Qingyuan Liu, Junfang Gong

Extracting meaningful patterns of human mobility from accumulating trajectories is essential for understanding human behavior. However, previous works identify human mobility patterns based on the spatial co-occurrence of trajectories, which ignores the effect of activity content, leaving challenges in effectively extracting and understanding patterns. To bridge this gap, this study incorporates the activity content of trajectories to extract human mobility patterns, and proposes acontent-aware mobility pattern model. The model first embeds the activity content in distributed continuous vector space by taking point-of-interest as an agent and then extracts representative and interpretable mobility patterns from human trajectory sets using a derived topic model. To investigate the performance of the proposed model, several evaluation metrics are developed, including pattern coherence, pattern similarity, and manual scoring. A real-world case study is conducted, and its experimental results show that the proposed model improves interpretability and helps to understand mobility patterns. This study provides not only a novel solution and several evaluation metrics for human mobility patterns but also a method reference for fusing content semantics of human activities for trajectory analysis and mining.

从累积的轨迹中提取有意义的人类移动模式对于理解人类行为至关重要。然而,以往的研究基于轨迹的空间共现来识别人类移动模式,忽略了活动内容的影响,给有效提取和理解模式带来了挑战。为了弥补这一不足,本研究结合轨迹的活动内容来提取人类移动模式,并提出了一种主动感知移动模式模型。该模型首先以兴趣点为代理将活动内容嵌入分布式连续向量空间,然后利用衍生的主题模型从人类轨迹集中提取具有代表性和可解释性的移动模式。为了研究拟议模型的性能,开发了几个评估指标,包括模式一致性、模式相似性和人工评分。我们进行了一项真实世界案例研究,实验结果表明,所提出的模型提高了可解释性,有助于理解移动模式。这项研究不仅为人类移动模式提供了新颖的解决方案和多个评价指标,还为融合人类活动的内容语义进行轨迹分析和挖掘提供了方法参考。
{"title":"Content-Aware Human Mobility Pattern Extraction.","authors":"Shengwen Li, Chaofan Fan, Tianci Li, Renyao Chen, Qingyuan Liu, Junfang Gong","doi":"10.1089/big.2022.0281","DOIUrl":"https://doi.org/10.1089/big.2022.0281","url":null,"abstract":"<p><p>Extracting meaningful patterns of human mobility from accumulating trajectories is essential for understanding human behavior. However, previous works identify human mobility patterns based on the spatial co-occurrence of trajectories, which ignores the effect of activity content, leaving challenges in effectively extracting and understanding patterns. To bridge this gap, this study incorporates the activity content of trajectories to extract human mobility patterns, and proposes acontent-aware mobility pattern model. The model first embeds the activity content in distributed continuous vector space by taking point-of-interest as an agent and then extracts representative and interpretable mobility patterns from human trajectory sets using a derived topic model. To investigate the performance of the proposed model, several evaluation metrics are developed, including pattern coherence, pattern similarity, and manual scoring. A real-world case study is conducted, and its experimental results show that the proposed model improves interpretability and helps to understand mobility patterns. This study provides not only a novel solution and several evaluation metrics for human mobility patterns but also a method reference for fusing content semantics of human activities for trajectory analysis and mining.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141565068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Basketball Big Data Platform for Box Score and Play-by-Play Data. 篮球大数据平台,用于盒式比分和逐场比赛数据。
IF 4.6 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-12 DOI: 10.1089/big.2023.0177
G. Vinué
This is the second part of a research diptych devoted to improving basketball data management in Spain. The Spanish ACB (Association of Basketball Clubs, acronym in Spanish) is the top European national competition. It attracts most of the best foreign players outside the NBA (National Basketball Association, in North America) and also accelerates the development of Spanish players who ultimately contribute to the success of the Spanish national team. However, this sporting excellence is not reciprocated by an advanced treatment of the data generated by teams and players, the so-called statistics. On the contrary, their use is still very rudimentary. An earlier article published in this journal in 2020 introduced the first open web application for interactive visualization of the box score data from three European competitions, including the ACB. Box score data refer to the data provided once the game is finished. Following the same inspiration, this new research aims to present the work carried out with more advanced data, namely, play-by-play data, which are provided as the game runs. This type of data allow us to gain greater insight into basketball performance, providing information that cannot be revealed with box score data. A new dashboard is developed to analyze play-by-play data from a number of different and novel perspectives. Furthermore, a comprehensive data platform encompassing the visualization of the ACB box score and play-by-play data is presented.
本文是致力于改进西班牙篮球数据管理的双联研究的第二部分。西班牙 ACB(篮球俱乐部协会,西班牙语缩写)是欧洲顶级国家级赛事。它吸引了 NBA(北美国家篮球协会)之外的大多数最优秀的外国球员,也加速了西班牙球员的发展,这些球员最终为西班牙国家队的成功做出了贡献。然而,体育界的这种卓越表现并没有得到球队和球员产生的数据(即所谓的统计数据)的先进处理。相反,对这些数据的使用还很初级。本刊早前于 2020 年发表的一篇文章介绍了首个开放式网络应用程序,用于交互式可视化包括 ACB 在内的三项欧洲赛事的盒式比分数据。盒式比分数据指的是比赛结束后提供的数据。基于同样的灵感,这项新的研究旨在介绍使用更先进数据开展的工作,即在比赛进行过程中提供的逐场比赛数据。这类数据能让我们更深入地了解篮球比赛的表现,提供盒装比分数据无法显示的信息。我们开发了一个新的仪表板,可以从多个不同的新角度分析逐场比赛数据。此外,还介绍了一个综合数据平台,其中包括 ACB 盒装比分和逐场比赛数据的可视化。
{"title":"A Basketball Big Data Platform for Box Score and Play-by-Play Data.","authors":"G. Vinué","doi":"10.1089/big.2023.0177","DOIUrl":"https://doi.org/10.1089/big.2023.0177","url":null,"abstract":"This is the second part of a research diptych devoted to improving basketball data management in Spain. The Spanish ACB (Association of Basketball Clubs, acronym in Spanish) is the top European national competition. It attracts most of the best foreign players outside the NBA (National Basketball Association, in North America) and also accelerates the development of Spanish players who ultimately contribute to the success of the Spanish national team. However, this sporting excellence is not reciprocated by an advanced treatment of the data generated by teams and players, the so-called statistics. On the contrary, their use is still very rudimentary. An earlier article published in this journal in 2020 introduced the first open web application for interactive visualization of the box score data from three European competitions, including the ACB. Box score data refer to the data provided once the game is finished. Following the same inspiration, this new research aims to present the work carried out with more advanced data, namely, play-by-play data, which are provided as the game runs. This type of data allow us to gain greater insight into basketball performance, providing information that cannot be revealed with box score data. A new dashboard is developed to analyze play-by-play data from a number of different and novel perspectives. Furthermore, a comprehensive data platform encompassing the visualization of the ACB box score and play-by-play data is presented.","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":4.6,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140709959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1