首页 > 最新文献

Big data analytics最新文献

英文 中文
Cyberpsychology: A Longitudinal Analysis of Cyber Adversarial Tactics and Techniques 网络心理学:网络对抗战术和技术的纵向分析
Pub Date : 2023-08-11 DOI: 10.3390/analytics2030035
Marshall S. Rich
The rapid proliferation of cyberthreats necessitates a robust understanding of their evolution and associated tactics, as found in this study. A longitudinal analysis of these threats was conducted, utilizing a six-year data set obtained from a deception network, which emphasized its significance in the study’s primary aim: the exhaustive exploration of the tactics and strategies utilized by cybercriminals and how these tactics and techniques evolved in sophistication and target specificity over time. Different cyberattack instances were dissected and interpreted, with the patterns behind target selection shown. The focus was on unveiling patterns behind target selection and highlighting recurring techniques and emerging trends. The study’s methodological design incorporated data preprocessing, exploratory data analysis, clustering and anomaly detection, temporal analysis, and cross-referencing. The validation process underscored the reliability and robustness of the findings, providing evidence of increasingly sophisticated, targeted cyberattacks. The work identified three distinct network traffic behavior clusters and temporal attack patterns. A validated scoring mechanism provided a benchmark for network anomalies, applicable for predictive analysis and facilitating comparative study of network behaviors. This benchmarking aids organizations in proactively identifying and responding to potential threats. The study significantly contributed to the cybersecurity discourse, offering insights that could guide the development of more effective defense strategies. The need for further investigation into the nature of detected anomalies was acknowledged, advocating for continuous research and proactive defense strategies in the face of the constantly evolving landscape of cyberthreats.
正如本研究发现的那样,网络威胁的快速扩散需要对其演变和相关策略有一个强有力的理解。利用从欺骗网络获得的六年数据集,对这些威胁进行了纵向分析,强调了其在研究主要目标中的重要性:详尽探索网络犯罪分子使用的战术和策略,以及这些战术和技术如何随着时间的推移在复杂性和目标特异性方面发展。分析和解释了不同的网络攻击实例,揭示了目标选择背后的模式。重点是揭示目标选择背后的模式,突出重复出现的技术和新兴趋势。该研究的方法设计包括数据预处理、探索性数据分析、聚类和异常检测、时间分析和交叉参考。验证过程强调了研究结果的可靠性和稳健性,为日益复杂、有针对性的网络攻击提供了证据。这项工作确定了三种不同的网络流量行为集群和时间攻击模式。经过验证的评分机制为网络异常提供了一个基准,适用于预测分析,便于网络行为的比较研究。这种基准测试有助于组织主动识别和响应潜在的威胁。该研究对网络安全话语做出了重大贡献,提供了可以指导更有效防御战略发展的见解。我们认识到有必要进一步调查检测到的异常的性质,并主张在面对不断变化的网络威胁时进行持续研究和主动防御策略。
{"title":"Cyberpsychology: A Longitudinal Analysis of Cyber Adversarial Tactics and Techniques","authors":"Marshall S. Rich","doi":"10.3390/analytics2030035","DOIUrl":"https://doi.org/10.3390/analytics2030035","url":null,"abstract":"The rapid proliferation of cyberthreats necessitates a robust understanding of their evolution and associated tactics, as found in this study. A longitudinal analysis of these threats was conducted, utilizing a six-year data set obtained from a deception network, which emphasized its significance in the study’s primary aim: the exhaustive exploration of the tactics and strategies utilized by cybercriminals and how these tactics and techniques evolved in sophistication and target specificity over time. Different cyberattack instances were dissected and interpreted, with the patterns behind target selection shown. The focus was on unveiling patterns behind target selection and highlighting recurring techniques and emerging trends. The study’s methodological design incorporated data preprocessing, exploratory data analysis, clustering and anomaly detection, temporal analysis, and cross-referencing. The validation process underscored the reliability and robustness of the findings, providing evidence of increasingly sophisticated, targeted cyberattacks. The work identified three distinct network traffic behavior clusters and temporal attack patterns. A validated scoring mechanism provided a benchmark for network anomalies, applicable for predictive analysis and facilitating comparative study of network behaviors. This benchmarking aids organizations in proactively identifying and responding to potential threats. The study significantly contributed to the cybersecurity discourse, offering insights that could guide the development of more effective defense strategies. The need for further investigation into the nature of detected anomalies was acknowledged, advocating for continuous research and proactive defense strategies in the face of the constantly evolving landscape of cyberthreats.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87463876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of Stroke Disease with Demographic and Behavioural Data Using Random Forest Algorithm 基于随机森林算法的人口统计学和行为学数据预测中风疾病
Pub Date : 2023-08-02 DOI: 10.3390/analytics2030034
O. Shobayo, Oluwafemi Zachariah, M. Odusami, Bayode Ogunleye
Stroke is a major cause of death worldwide, resulting from a blockage in the flow of blood to different parts of the brain. Many studies have proposed a stroke disease prediction model using medical features applied to deep learning (DL) algorithms to reduce its occurrence. However, these studies pay less attention to the predictors (both demographic and behavioural). Our study considers interpretability, robustness, and generalisation as key themes for deploying algorithms in the medical domain. Based on this background, we propose the use of random forest for stroke incidence prediction. Results from our experiment showed that random forest (RF) outperformed decision tree (DT) and logistic regression (LR) with a macro F1 score of 94%. Our findings indicated age and body mass index (BMI) as the most significant predictors of stroke disease incidence.
中风是世界范围内导致死亡的主要原因之一,其原因是大脑不同部位的血液流动受阻。许多研究提出了将医学特征应用于深度学习(DL)算法的中风疾病预测模型,以减少其发生。然而,这些研究很少关注预测因素(人口统计学和行为学)。我们的研究将可解释性、鲁棒性和泛化作为在医学领域部署算法的关键主题。在此背景下,我们提出使用随机森林进行脑卒中发病率预测。我们的实验结果表明,随机森林(RF)优于决策树(DT)和逻辑回归(LR),其宏观F1得分为94%。我们的研究结果表明,年龄和身体质量指数(BMI)是中风发病率最重要的预测因子。
{"title":"Prediction of Stroke Disease with Demographic and Behavioural Data Using Random Forest Algorithm","authors":"O. Shobayo, Oluwafemi Zachariah, M. Odusami, Bayode Ogunleye","doi":"10.3390/analytics2030034","DOIUrl":"https://doi.org/10.3390/analytics2030034","url":null,"abstract":"Stroke is a major cause of death worldwide, resulting from a blockage in the flow of blood to different parts of the brain. Many studies have proposed a stroke disease prediction model using medical features applied to deep learning (DL) algorithms to reduce its occurrence. However, these studies pay less attention to the predictors (both demographic and behavioural). Our study considers interpretability, robustness, and generalisation as key themes for deploying algorithms in the medical domain. Based on this background, we propose the use of random forest for stroke incidence prediction. Results from our experiment showed that random forest (RF) outperformed decision tree (DT) and logistic regression (LR) with a macro F1 score of 94%. Our findings indicated age and body mass index (BMI) as the most significant predictors of stroke disease incidence.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73794205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of Patterns in the Stock Market through Unsupervised Algorithms 通过无监督算法识别股票市场的模式
Pub Date : 2023-07-27 DOI: 10.3390/analytics2030033
Adrian Barradas, R. Cantón-Croda, D. Gibaja-Romero
Making predictions in the stock market is a challenging task. At the same time, several studies have focused on forecasting the future behavior of the market and classifying financial assets. A different approach is to classify correlated data to discover patterns and atypical behaviors in them. In this study, we propose applying unsupervised algorithms to process, model, and cluster related data from two different data sources, i.e., Google News and Yahoo Finance, to identify conditions in the stock market that might help to support the investment decision-making process. We applied principal component analysis (PCA) and a k-means clustering approach to group data according to their principal characteristics. We identified four conditions in the stock market, one comprising the least amount of data, characterized by high volatility. The main results show that, regularly, the stock market tends to have a steady performance. However, atypical conditions are conducive to higher volatility.
在股市中进行预测是一项具有挑战性的任务。与此同时,一些研究集中在预测市场的未来行为和分类金融资产。另一种方法是对相关数据进行分类,以发现其中的模式和非典型行为。在本研究中,我们提出应用无监督算法来处理、建模和聚类来自两个不同数据源的相关数据,即b谷歌新闻和雅虎财经,以确定股票市场中可能有助于支持投资决策过程的条件。我们应用主成分分析(PCA)和k-means聚类方法根据数据的主特征对数据进行分组。我们确定了股票市场的四种情况,其中一种由最少的数据组成,其特征是高波动性。主要结果表明,股票市场的表现趋于稳定。然而,非典型条件有利于更高的波动性。
{"title":"Identification of Patterns in the Stock Market through Unsupervised Algorithms","authors":"Adrian Barradas, R. Cantón-Croda, D. Gibaja-Romero","doi":"10.3390/analytics2030033","DOIUrl":"https://doi.org/10.3390/analytics2030033","url":null,"abstract":"Making predictions in the stock market is a challenging task. At the same time, several studies have focused on forecasting the future behavior of the market and classifying financial assets. A different approach is to classify correlated data to discover patterns and atypical behaviors in them. In this study, we propose applying unsupervised algorithms to process, model, and cluster related data from two different data sources, i.e., Google News and Yahoo Finance, to identify conditions in the stock market that might help to support the investment decision-making process. We applied principal component analysis (PCA) and a k-means clustering approach to group data according to their principal characteristics. We identified four conditions in the stock market, one comprising the least amount of data, characterized by high volatility. The main results show that, regularly, the stock market tends to have a steady performance. However, atypical conditions are conducive to higher volatility.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80952271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Streamflow Estimation through Coupling of Hieararchical Clustering Analysis and Regression Analysis—A Case Study in Euphrates-Tigris Basin 基于层次聚类分析和回归分析的河流流量估算——以幼发拉底河流域为例
Pub Date : 2023-07-13 DOI: 10.3390/analytics2030032
Goksel Ezgi Guzey, Bihrat Onoz
In this study, the resilience of designed water systems in the face of limited streamflow gauging stations and escalating global warming impacts were investigated. By performing a regression analysis, simulated meteorological data with observed streamflow from 1971 to 2020 across 33 stream gauging stations in the Euphrates-Tigris Basin were correlated. Utilizing the Ordinary Least Squares regression method, streamflow for 2020–2100 using simulated meteorological data under RCP 4.5 and RCP 8.5 scenarios in CORDEX-EURO and CORDEX-MENA domains were also predicted. Streamflow variability was calculated based on meteorological variables and station morphological characteristics, particularly evapotranspiration. Hierarchical clustering analysis identified two clusters among the stream gauging stations, and for each cluster, two streamflow equations were derived. The regression analysis achieved robust streamflow predictions using six representative climate variables, with adj. R2 values of 0.7–0.85 across all models, primarily influenced by evapotranspiration. The use of a global model led to a 10% decrease in prediction capabilities for all CORDEX models based on R2 performance. This study emphasizes the importance of region homogeneity in estimating streamflow, encompassing both geographical and hydro-meteorological characteristics.
在这项研究中,设计水系统面对有限的流量测量站和不断升级的全球变暖影响的弹性进行了研究。通过回归分析,对幼发拉底河流域33个测量站1971 - 2020年的模拟气象资料与实测流量进行了相关性分析。利用普通最小二乘回归方法,对CORDEX-EURO和CORDEX-MENA域2020-2100年RCP 4.5和RCP 8.5情景下的模拟气象资料进行了预测。根据气象变量和台站形态特征,特别是蒸散量,计算了径流变率。通过层次聚类分析,在各测量站之间确定了两个聚类,并对每个聚类导出了两个流量方程。利用6个代表性气候变量进行回归分析,得到了可靠的流量预测结果,所有模型的相对值为0.7 ~ 0.85,主要受蒸散发的影响。使用全局模型导致基于R2性能的所有CORDEX模型的预测能力下降10%。该研究强调了区域均匀性在估算流量中的重要性,包括地理和水文气象特征。
{"title":"Streamflow Estimation through Coupling of Hieararchical Clustering Analysis and Regression Analysis—A Case Study in Euphrates-Tigris Basin","authors":"Goksel Ezgi Guzey, Bihrat Onoz","doi":"10.3390/analytics2030032","DOIUrl":"https://doi.org/10.3390/analytics2030032","url":null,"abstract":"In this study, the resilience of designed water systems in the face of limited streamflow gauging stations and escalating global warming impacts were investigated. By performing a regression analysis, simulated meteorological data with observed streamflow from 1971 to 2020 across 33 stream gauging stations in the Euphrates-Tigris Basin were correlated. Utilizing the Ordinary Least Squares regression method, streamflow for 2020–2100 using simulated meteorological data under RCP 4.5 and RCP 8.5 scenarios in CORDEX-EURO and CORDEX-MENA domains were also predicted. Streamflow variability was calculated based on meteorological variables and station morphological characteristics, particularly evapotranspiration. Hierarchical clustering analysis identified two clusters among the stream gauging stations, and for each cluster, two streamflow equations were derived. The regression analysis achieved robust streamflow predictions using six representative climate variables, with adj. R2 values of 0.7–0.85 across all models, primarily influenced by evapotranspiration. The use of a global model led to a 10% decrease in prediction capabilities for all CORDEX models based on R2 performance. This study emphasizes the importance of region homogeneity in estimating streamflow, encompassing both geographical and hydro-meteorological characteristics.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"81 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80978870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading 基于层次模型的单资产交易深度强化学习
Pub Date : 2023-07-11 DOI: 10.3390/analytics2030031
Adrian Millea
We present a hierarchical reinforcement learning (RL) architecture that employs various low-level agents to act in the trading environment, i.e., the market. The highest-level agent selects from among a group of specialized agents, and then the selected agent decides when to sell or buy a single asset for a period of time. This period can be variable according to a termination function. We hypothesized that, due to different market regimes, more than one single agent is needed when trying to learn from such heterogeneous data, and instead, multiple agents will perform better, with each one specializing in a subset of the data. We use k-meansclustering to partition the data and train each agent with a different cluster. Partitioning the input data also helps model-based RL (MBRL), where models can be heterogeneous. We also add two simple decision-making models to the set of low-level agents, diversifying the pool of available agents, and thus increasing overall behavioral flexibility. We perform multiple experiments showing the strengths of a hierarchical approach and test various prediction models at both levels. We also use a risk-based reward at the high level, which transforms the overall problem into a risk-return optimization. This type of reward shows a significant reduction in risk while minimally reducing profits. Overall, the hierarchical approach shows significant promise, especially when the pool of low-level agents is highly diverse. The usefulness of such a system is clear, especially for human-devised strategies, which could be incorporated in a sound manner into larger, powerful automatic systems.
我们提出了一种分层强化学习(RL)架构,该架构采用各种低级代理在交易环境(即市场)中进行操作。最高级别的代理从一组专门的代理中进行选择,然后被选中的代理决定在一段时间内何时出售或购买单个资产。这个周期可以根据终止函数而变化。我们假设,由于不同的市场制度,当尝试从这种异构数据中学习时,需要多个代理,相反,多个代理将表现更好,每个代理专门研究数据的一个子集。我们使用k-means聚类对数据进行分区,并使用不同的聚类训练每个代理。划分输入数据也有助于基于模型的RL (MBRL),其中模型可以是异构的。我们还向低级代理集添加了两个简单的决策模型,使可用代理池多样化,从而提高了整体行为灵活性。我们进行了多个实验,展示了分层方法的优势,并在两个层次上测试了各种预测模型。我们还在高层次上使用基于风险的奖励,这将整个问题转化为风险-回报优化。这种类型的奖励在最小化利润的同时显著降低了风险。总的来说,分层方法显示出显著的前景,特别是当低级代理的池高度多样化时。这种系统的有用性是显而易见的,特别是对于人类设计的战略,这些战略可以以合理的方式纳入更大、更强大的自动系统。
{"title":"Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading","authors":"Adrian Millea","doi":"10.3390/analytics2030031","DOIUrl":"https://doi.org/10.3390/analytics2030031","url":null,"abstract":"We present a hierarchical reinforcement learning (RL) architecture that employs various low-level agents to act in the trading environment, i.e., the market. The highest-level agent selects from among a group of specialized agents, and then the selected agent decides when to sell or buy a single asset for a period of time. This period can be variable according to a termination function. We hypothesized that, due to different market regimes, more than one single agent is needed when trying to learn from such heterogeneous data, and instead, multiple agents will perform better, with each one specializing in a subset of the data. We use k-meansclustering to partition the data and train each agent with a different cluster. Partitioning the input data also helps model-based RL (MBRL), where models can be heterogeneous. We also add two simple decision-making models to the set of low-level agents, diversifying the pool of available agents, and thus increasing overall behavioral flexibility. We perform multiple experiments showing the strengths of a hierarchical approach and test various prediction models at both levels. We also use a risk-based reward at the high level, which transforms the overall problem into a risk-return optimization. This type of reward shows a significant reduction in risk while minimally reducing profits. Overall, the hierarchical approach shows significant promise, especially when the pool of low-level agents is highly diverse. The usefulness of such a system is clear, especially for human-devised strategies, which could be incorporated in a sound manner into larger, powerful automatic systems.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86190977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
occams: A Text Summarization Package occams:文本摘要包
Pub Date : 2023-06-30 DOI: 10.3390/analytics2030030
Clinton T. White, Neil P. Molino, Julia S. Yang, John M. Conroy
Extractive text summarization selects asmall subset of sentences from a document, which gives good “coverage” of a document. When given a set of term weights indicating the importance of the terms, the concept of coverage may be formalized into a combinatorial optimization problem known as the budgeted maximum coverage problem. Extractive methods in this class are known to beamong the best of classic extractive summarization systems. This paper gives a synopsis of thesoftware package occams, which is a multilingual extractive single and multi-document summarization package based on an algorithm giving an optimal approximation to the budgeted maximum coverage problem. The occams package is written in Python and provides an easy-to-use modular interface, allowing it to work in conjunction with popular Python NLP packages, such as nltk, stanza or spacy.
提取文本摘要从文档中选择句子的一小部分,这可以很好地“覆盖”文档。当给定一组表示术语重要性的术语权重时,覆盖率的概念可以形式化为称为预算最大覆盖率问题的组合优化问题。这类的提取方法被认为是最好的经典提取摘要系统之一。本文简要介绍了occams软件包,它是一个基于算法的多语言提取单文档和多文档摘要软件包,该算法给出了预算最大覆盖问题的最优逼近。occams包是用Python编写的,并提供了一个易于使用的模块化接口,允许它与流行的Python NLP包(如nltk, stanza或space)一起工作。
{"title":"occams: A Text Summarization Package","authors":"Clinton T. White, Neil P. Molino, Julia S. Yang, John M. Conroy","doi":"10.3390/analytics2030030","DOIUrl":"https://doi.org/10.3390/analytics2030030","url":null,"abstract":"Extractive text summarization selects asmall subset of sentences from a document, which gives good “coverage” of a document. When given a set of term weights indicating the importance of the terms, the concept of coverage may be formalized into a combinatorial optimization problem known as the budgeted maximum coverage problem. Extractive methods in this class are known to beamong the best of classic extractive summarization systems. This paper gives a synopsis of thesoftware package occams, which is a multilingual extractive single and multi-document summarization package based on an algorithm giving an optimal approximation to the budgeted maximum coverage problem. The occams package is written in Python and provides an easy-to-use modular interface, allowing it to work in conjunction with popular Python NLP packages, such as nltk, stanza or spacy.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79443044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Bayesian Mixture Copula Estimation and Selection with Applications 贝叶斯混合Copula估计与选择及其应用
Pub Date : 2023-06-15 DOI: 10.3390/analytics2020029
Yujian Liu, Dejun Xie, Siyi Yu
Mixture copulas are popular and essential tools for studying complex dependencies among variables. However, selecting the correct mixture models often involves repeated testing and estimations using criteria such as AIC, which could require effort and time. In this paper, we propose a method that would enable us to select and estimate the correct mixture copulas simultaneously. This is accomplished by first overfitting the model and then conducting the Bayesian estimations. We verify the correctness of our approach by numerical simulations. Finally, the real data analysis is performed by studying the dependencies among three major financial markets.
混合copula是研究变量间复杂依赖关系的重要工具。然而,选择正确的混合模型通常涉及使用AIC等标准的重复测试和估计,这可能需要精力和时间。在本文中,我们提出了一种能够同时选择和估计正确的混合copula的方法。这是通过首先过拟合模型,然后进行贝叶斯估计来完成的。通过数值模拟验证了该方法的正确性。最后,通过研究三大金融市场之间的依赖关系进行真实数据分析。
{"title":"Bayesian Mixture Copula Estimation and Selection with Applications","authors":"Yujian Liu, Dejun Xie, Siyi Yu","doi":"10.3390/analytics2020029","DOIUrl":"https://doi.org/10.3390/analytics2020029","url":null,"abstract":"Mixture copulas are popular and essential tools for studying complex dependencies among variables. However, selecting the correct mixture models often involves repeated testing and estimations using criteria such as AIC, which could require effort and time. In this paper, we propose a method that would enable us to select and estimate the correct mixture copulas simultaneously. This is accomplished by first overfitting the model and then conducting the Bayesian estimations. We verify the correctness of our approach by numerical simulations. Finally, the real data analysis is performed by studying the dependencies among three major financial markets.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77407683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Preliminary Perspectives on Information Passing in the Intelligence Community 情报界信息传递的初步展望
Pub Date : 2023-06-15 DOI: 10.3390/analytics2020028
Jeremy E. Block, Ilana Bookner, S. Chu, R. J. Crouser, Donald R. Honeycutt, Rebecca M. Jonas, Abhishek Kulkarni, Yancy Vance M. Paredes, E. Ragan
Analyst sensemaking research typically focuses on individual or small groups conducting intelligence tasks. This has helped understand information retrieval tasks and how people communicate information. As a part of the grand challenge of the Summer Conference on Applied Data Science (SCADS) to build a system that can generate tailored daily reports (TLDR) for intelligence analysts, we conducted a qualitative interview study with analysts to increase understanding of information passing in the intelligence community. While our results are preliminary, we expect that this work will contribute to a better understanding of the information ecosystem of the intelligence community, how institutional dynamics affect information passing, and what implications this has for a TLDR system. This work describes our involvement in and work completed during SCADS. Although preliminary, we identify that information passing is both a formal and informal process and often follows professional networks due especially to the small population and specialization of work. We call attention to the need for future analysis of information ecosystems to better support tailored information retrieval features.
分析师的语义研究通常集中在执行情报任务的个人或小团体上。这有助于理解信息检索任务以及人们如何交流信息。作为应用数据科学夏季会议(SCADS)重大挑战的一部分,我们建立了一个系统,可以为情报分析人员生成定制的每日报告(TLDR),我们对分析人员进行了定性访谈研究,以增加对情报界传递的信息的理解。虽然我们的结果是初步的,但我们期望这项工作将有助于更好地理解情报界的信息生态系统,制度动态如何影响信息传递,以及这对TLDR系统的影响。这项工作描述了我们在SCADS期间参与和完成的工作。虽然是初步的,但我们确定信息传递是一个正式和非正式的过程,特别是由于人口少和工作专业化,通常遵循专业网络。我们呼吁关注未来信息生态系统分析的需要,以更好地支持定制的信息检索功能。
{"title":"Preliminary Perspectives on Information Passing in the Intelligence Community","authors":"Jeremy E. Block, Ilana Bookner, S. Chu, R. J. Crouser, Donald R. Honeycutt, Rebecca M. Jonas, Abhishek Kulkarni, Yancy Vance M. Paredes, E. Ragan","doi":"10.3390/analytics2020028","DOIUrl":"https://doi.org/10.3390/analytics2020028","url":null,"abstract":"Analyst sensemaking research typically focuses on individual or small groups conducting intelligence tasks. This has helped understand information retrieval tasks and how people communicate information. As a part of the grand challenge of the Summer Conference on Applied Data Science (SCADS) to build a system that can generate tailored daily reports (TLDR) for intelligence analysts, we conducted a qualitative interview study with analysts to increase understanding of information passing in the intelligence community. While our results are preliminary, we expect that this work will contribute to a better understanding of the information ecosystem of the intelligence community, how institutional dynamics affect information passing, and what implications this has for a TLDR system. This work describes our involvement in and work completed during SCADS. Although preliminary, we identify that information passing is both a formal and informal process and often follows professional networks due especially to the small population and specialization of work. We call attention to the need for future analysis of information ecosystems to better support tailored information retrieval features.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81962561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatiotemporal Data Mining Problems and Methods 时空数据挖掘问题与方法
Pub Date : 2023-06-14 DOI: 10.3390/analytics2020027
Eleftheria Koutsaki, George Vardakis, N. Papadakis
Many scientific fields show great interest in the extraction and processing of spatiotemporal data, such as medicine with an emphasis on epidemiology and neurology, geology, social sciences, meteorology, and a great interest is also observed in the study of transport. Spatiotemporal data differ significantly from spatial data, since spatiotemporal data refer to measurements, which take into account both the place and the time in which they are received, with their respective characteristics, while spatial data refer to and describe information related only to place. The innovation brought about by spatiotemporal data mining has caused a revolution in many scientific fields, and this is because through it we can now provide solutions and answers to complex problems, as well as provide useful and valuable predictions, through predictive learning. However, combining time and place in data mining presents significant challenges and difficulties that must be overcome. Spatiotemporal data mining and analysis is a relatively new approach to data mining which has been studied more systematically in the last decade. The purpose of this article is to provide a good introduction to spatiotemporal data, and through this detailed description, we attempt to introduce descriptive logic and gain a complete knowledge of these data. We aim to introduce a new way of describing them, aiming for future studies, by combining the expressions that arise by type of data, using descriptive logic, with new expressions, that can be derived, to describe future states of objects and environments with great precision, providing accurate predictions. In order to highlight the value of spatiotemporal data, we proceed to give a brief description of ST data in the introduction. We describe the relevant work carried out to date, the types of spatiotemporal (ST) data, their properties and the transformations that can be made between them, attempting, to a small extent, to introduce constraints and rules using descriptive logic, introducing descriptive logic into spatiotemporal data by type, when initially presenting the ST data. The data snapshots by species and similarities between the cases are then described. We describe methods, introducing clustering, dynamic ST clusters, predictive learning, pattern mining frequency, and pattern emergence, and problems such as anomaly detection, identifying time points of changes in the behavior of the observed object, and development of relationships between them. We describe the application of ST data in various fields today, as well as the future work. We finally conclude with our conclusions, with the representation and study of spatiotemporal data can, in combination with other properties which accompany all natural phenomena, through their appropriate processing, lead to safe conclusions regarding the study of problems, and also with great precision in the extraction of predictions by accurately determining future states of an environmen
许多科学领域对时空数据的提取和处理表现出极大的兴趣,如医学对流行病学和神经病学、地质学、社会科学、气象学的重视,对交通运输的研究也表现出极大的兴趣。时空数据与空间数据有很大的不同,因为时空数据是指测量结果,它考虑了接收数据的地点和时间,并具有各自的特征,而空间数据是指和描述仅与地点相关的信息。时空数据挖掘带来的创新在许多科学领域引发了一场革命,这是因为通过它,我们现在可以通过预测学习为复杂问题提供解决方案和答案,以及提供有用和有价值的预测。然而,在数据挖掘中结合时间和地点提出了必须克服的重大挑战和困难。时空数据挖掘与分析是一种相对较新的数据挖掘方法,近十年来得到了较为系统的研究。本文的目的是为时空数据提供一个很好的介绍,并通过这种详细的描述,我们试图引入描述性逻辑,并获得这些数据的完整知识。我们的目标是引入一种描述它们的新方法,针对未来的研究,通过将数据类型产生的表达式,使用描述性逻辑,与可以导出的新表达式相结合,以非常精确的方式描述对象和环境的未来状态,提供准确的预测。为了突出时空数据的价值,我们在引言中对ST数据进行了简要的描述。我们描述了迄今为止开展的相关工作,时空(ST)数据的类型,它们的属性以及它们之间可以进行的转换,在很小的程度上,尝试使用描述性逻辑引入约束和规则,在最初呈现ST数据时,按类型将描述性逻辑引入时空数据。然后描述了物种的数据快照和案例之间的相似性。我们描述了方法,介绍了聚类、动态ST聚类、预测学习、模式挖掘频率和模式出现,以及异常检测、识别观察对象行为变化的时间点以及它们之间关系的发展等问题。我们描述了目前ST数据在各个领域的应用,以及未来的工作。我们最后总结了我们的结论,时空数据的表示和研究可以与所有自然现象的其他属性相结合,通过适当的处理,得出关于问题研究的安全结论,并且通过准确确定环境或对象的未来状态,在提取预测方面也具有很高的精度。因此,温度数据的重要性使它们在今天的各个科学领域特别有价值,它们的提取是未来特别苛刻的挑战。
{"title":"Spatiotemporal Data Mining Problems and Methods","authors":"Eleftheria Koutsaki, George Vardakis, N. Papadakis","doi":"10.3390/analytics2020027","DOIUrl":"https://doi.org/10.3390/analytics2020027","url":null,"abstract":"Many scientific fields show great interest in the extraction and processing of spatiotemporal data, such as medicine with an emphasis on epidemiology and neurology, geology, social sciences, meteorology, and a great interest is also observed in the study of transport. Spatiotemporal data differ significantly from spatial data, since spatiotemporal data refer to measurements, which take into account both the place and the time in which they are received, with their respective characteristics, while spatial data refer to and describe information related only to place. The innovation brought about by spatiotemporal data mining has caused a revolution in many scientific fields, and this is because through it we can now provide solutions and answers to complex problems, as well as provide useful and valuable predictions, through predictive learning. However, combining time and place in data mining presents significant challenges and difficulties that must be overcome. Spatiotemporal data mining and analysis is a relatively new approach to data mining which has been studied more systematically in the last decade. The purpose of this article is to provide a good introduction to spatiotemporal data, and through this detailed description, we attempt to introduce descriptive logic and gain a complete knowledge of these data. We aim to introduce a new way of describing them, aiming for future studies, by combining the expressions that arise by type of data, using descriptive logic, with new expressions, that can be derived, to describe future states of objects and environments with great precision, providing accurate predictions. In order to highlight the value of spatiotemporal data, we proceed to give a brief description of ST data in the introduction. We describe the relevant work carried out to date, the types of spatiotemporal (ST) data, their properties and the transformations that can be made between them, attempting, to a small extent, to introduce constraints and rules using descriptive logic, introducing descriptive logic into spatiotemporal data by type, when initially presenting the ST data. The data snapshots by species and similarities between the cases are then described. We describe methods, introducing clustering, dynamic ST clusters, predictive learning, pattern mining frequency, and pattern emergence, and problems such as anomaly detection, identifying time points of changes in the behavior of the observed object, and development of relationships between them. We describe the application of ST data in various fields today, as well as the future work. We finally conclude with our conclusions, with the representation and study of spatiotemporal data can, in combination with other properties which accompany all natural phenomena, through their appropriate processing, lead to safe conclusions regarding the study of problems, and also with great precision in the extraction of predictions by accurately determining future states of an environmen","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91169182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Zero-Truncated Katz Distribution by the Lagrange Expansion of the Second Kind with Associated Inferences 用带关联推论的第二类拉格朗日展开的一种新的零截尾Katz分布
Pub Date : 2023-06-01 DOI: 10.3390/analytics2020026
D. S. Shibu, C. Chesneau, M. Monisha, R. Maya, M. Irshad
In this article, the Lagrange expansion of the second kind is used to generate a novel zero-truncated Katz distribution; we refer to it as the Lagrangian zero-truncated Katz distribution (LZTKD). Notably, the zero-truncated Katz distribution is a special case of this distribution. Along with the closed form expression of all its statistical characteristics, the LZTKD is proven to provide an adequate model for both underdispersed and overdispersed zero-truncated count datasets. Specifically, we show that the associated hazard rate function has increasing, decreasing, bathtub, or upside-down bathtub shapes. Moreover, we demonstrate that the LZTKD belongs to the Lagrangian distribution of the first kind. Then, applications of the LZTKD in statistical scenarios are explored. The unknown parameters are estimated using the well-reputed method of the maximum likelihood. In addition, the generalized likelihood ratio test procedure is applied to test the significance of the additional parameter. In order to evaluate the performance of the maximum likelihood estimates, simulation studies are also conducted. The use of real-life datasets further highlights the relevance and applicability of the proposed model.
本文利用第二类的拉格朗日展开生成了一种新的零截尾Katz分布;我们称之为拉格朗日零截断卡茨分布(LZTKD)。值得注意的是,零截断的Katz分布是该分布的一种特殊情况。随着其所有统计特征的封闭形式表达,LZTKD被证明为欠分散和过分散的零截断计数数据集提供了一个适当的模型。具体来说,我们展示了相关的危险率函数具有增加、减少、浴缸或倒立浴缸的形状。进一步证明了LZTKD属于第一类拉格朗日分布。然后,探讨了LZTKD在统计场景中的应用。未知参数的估计使用著名的最大似然方法。此外,应用广义似然比检验程序检验附加参数的显著性。为了评价最大似然估计的性能,还进行了仿真研究。实际数据集的使用进一步突出了所提出模型的相关性和适用性。
{"title":"A Novel Zero-Truncated Katz Distribution by the Lagrange Expansion of the Second Kind with Associated Inferences","authors":"D. S. Shibu, C. Chesneau, M. Monisha, R. Maya, M. Irshad","doi":"10.3390/analytics2020026","DOIUrl":"https://doi.org/10.3390/analytics2020026","url":null,"abstract":"In this article, the Lagrange expansion of the second kind is used to generate a novel zero-truncated Katz distribution; we refer to it as the Lagrangian zero-truncated Katz distribution (LZTKD). Notably, the zero-truncated Katz distribution is a special case of this distribution. Along with the closed form expression of all its statistical characteristics, the LZTKD is proven to provide an adequate model for both underdispersed and overdispersed zero-truncated count datasets. Specifically, we show that the associated hazard rate function has increasing, decreasing, bathtub, or upside-down bathtub shapes. Moreover, we demonstrate that the LZTKD belongs to the Lagrangian distribution of the first kind. Then, applications of the LZTKD in statistical scenarios are explored. The unknown parameters are estimated using the well-reputed method of the maximum likelihood. In addition, the generalized likelihood ratio test procedure is applied to test the significance of the additional parameter. In order to evaluate the performance of the maximum likelihood estimates, simulation studies are also conducted. The use of real-life datasets further highlights the relevance and applicability of the proposed model.","PeriodicalId":93078,"journal":{"name":"Big data analytics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89251864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big data analytics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1