Risk analytics is an integral component in the overall assessment of the risk profile for potential and existing obligors. For example, credit worthiness is often assessed via the use of scorecards, which are regulatory credit risk models developed based on historical data and domain expertise in banks and financial institutions. A pure statistical model, however, often fails to entertain regulatory requirements on both predictiveness and interpretability at the same time. Instead, practical risk models are developed by incorporating expert opinions within the development process, such as forcing the direction of travel for certain financial factors. In this article, the author proposes a unified framework, termed constrained and partially regularized logistic regression (CPR-LR) model, on how human inputs could be embedded in the statistical estimation procedure when developing credit risk models. By expressing such inputs as model constraints at different levels, the proposed approach serves as an effective solution to developing intuitive, easy-to-interpret, and statistically robust credit risk models, as demonstrated in the author’s experiments. This work also contributes to the growing field of human-in-the-loop model development, in which the author shows that domain expertise can be formulated as model constraints, thus biasing the resulting statistical model to be more interpretable and regulation compliant.
{"title":"An Integrated Framework on Human-in-the-Loop Risk Analytics","authors":"Peng Liu","doi":"10.3905/jfds.2022.1.116","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.116","url":null,"abstract":"Risk analytics is an integral component in the overall assessment of the risk profile for potential and existing obligors. For example, credit worthiness is often assessed via the use of scorecards, which are regulatory credit risk models developed based on historical data and domain expertise in banks and financial institutions. A pure statistical model, however, often fails to entertain regulatory requirements on both predictiveness and interpretability at the same time. Instead, practical risk models are developed by incorporating expert opinions within the development process, such as forcing the direction of travel for certain financial factors. In this article, the author proposes a unified framework, termed constrained and partially regularized logistic regression (CPR-LR) model, on how human inputs could be embedded in the statistical estimation procedure when developing credit risk models. By expressing such inputs as model constraints at different levels, the proposed approach serves as an effective solution to developing intuitive, easy-to-interpret, and statistically robust credit risk models, as demonstrated in the author’s experiments. This work also contributes to the growing field of human-in-the-loop model development, in which the author shows that domain expertise can be formulated as model constraints, thus biasing the resulting statistical model to be more interpretable and regulation compliant.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123003377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study systematically investigates different ensemble methods for meta-labeling in finance and presents a framework to facilitate the selection of ensemble learning models for this purpose. Experiments were conducted on the components of information advantage and modeling for false positives to discover whether ensembles were better at extracting and detecting regimes and whether they increased model efficiency. The authors demonstrate that ensembles are especially beneficial when the underlying data consist of multiple regimes and are nonlinear in nature. The authors’ framework serves as a starting point for further research. They suggest that the use of different fusion strategies may foster model selection. Finally, the authors elaborate on how additional applications, such as position sizing, may benefit from their framework.
{"title":"Ensemble Meta-Labeling","authors":"Dennis Thumm, P. Barucca, J. Joubert","doi":"10.3905/jfds.2022.1.114","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.114","url":null,"abstract":"This study systematically investigates different ensemble methods for meta-labeling in finance and presents a framework to facilitate the selection of ensemble learning models for this purpose. Experiments were conducted on the components of information advantage and modeling for false positives to discover whether ensembles were better at extracting and detecting regimes and whether they increased model efficiency. The authors demonstrate that ensembles are especially beneficial when the underlying data consist of multiple regimes and are nonlinear in nature. The authors’ framework serves as a starting point for further research. They suggest that the use of different fusion strategies may foster model selection. Finally, the authors elaborate on how additional applications, such as position sizing, may benefit from their framework.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116008454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past decade, there is a surging trend to integrate environmental, social, and governance (ESG) criteria into financial decision making. ESG information extracted manually from text sources, such as company statements, press releases, and regulatory disclosures, can be expensive and inconsistent due to human interpretation. In this article, the authors introduce the application of prompt-based learning, a cutting-edge natural language processing (NLP) technology, to classify textual data into ESG and non-ESG categories. In particular, the authors establish a prompt-based ESG classifier, using data from Refinitiv, and benchmark it against a traditional pre-train and fine-tune classifier through statistical test. The authors fine-tune the classifiers on various sizes of training data. The experiment shows that the prompt-based learning approach outperforms the traditional pre-train and fine-tune classifier and can generate promising results when training data are limited.
{"title":"ESG Text Classification: An Application of the Prompt-Based Learning Approach","authors":"Zhengzheng Yang, Le Zhang, Xiaoyu Wang, Yubo Mai","doi":"10.3905/jfds.2022.1.115","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.115","url":null,"abstract":"Over the past decade, there is a surging trend to integrate environmental, social, and governance (ESG) criteria into financial decision making. ESG information extracted manually from text sources, such as company statements, press releases, and regulatory disclosures, can be expensive and inconsistent due to human interpretation. In this article, the authors introduce the application of prompt-based learning, a cutting-edge natural language processing (NLP) technology, to classify textual data into ESG and non-ESG categories. In particular, the authors establish a prompt-based ESG classifier, using data from Refinitiv, and benchmark it against a traditional pre-train and fine-tune classifier through statistical test. The authors fine-tune the classifiers on various sizes of training data. The experiment shows that the prompt-based learning approach outperforms the traditional pre-train and fine-tune classifier and can generate promising results when training data are limited.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124015700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, the authors show how to build a real-time geopolitical risk index from news data using textual analysis. The presented method defines a point-in-time dictionary of terms related to political tension. It does not rely on the in-sample definition of a set of n-grams that are likely chosen and updated with hindsight bias. The proposed model can be applied to any topic and is language agnostic. Only a few topic-related words are required to initialize the buildup of a dynamically self-adjusting dictionary. The authors show that their approach can resemble the results of other more supervised methods. The findings indicate how topic identification and news index construction may benefit from a time-dependent dictionary generation.
{"title":"Point-in-Time Language Model for Geopolitical Risk Events","authors":"Matthias Apel, A. Betzer, B. Scherer","doi":"10.3905/jfds.2022.1.113","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.113","url":null,"abstract":"In this article, the authors show how to build a real-time geopolitical risk index from news data using textual analysis. The presented method defines a point-in-time dictionary of terms related to political tension. It does not rely on the in-sample definition of a set of n-grams that are likely chosen and updated with hindsight bias. The proposed model can be applied to any topic and is language agnostic. Only a few topic-related words are required to initialize the buildup of a dynamically self-adjusting dictionary. The authors show that their approach can resemble the results of other more supervised methods. The findings indicate how topic identification and news index construction may benefit from a time-dependent dictionary generation.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"71 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128716219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors describe a new prediction system based on relevance, which gives a mathematically precise measure of the importance of an observation to forming a prediction, as well as fit, which measures a specific prediction’s reliability. They show how their relevance-based approach to prediction identifies the optimal combination of observations and predictive variables for any given prediction task, thereby presenting a unified alternative to both kernel regression and lasso regression, which they call CKT regression. They argue that their new prediction system addresses complexities that are beyond the capacity of linear regression analysis but in a way that is more transparent, more flexible, and less arbitrary than widely used machine learning algorithms.
{"title":"Relevance-Based Prediction: A Transparent and Adaptive Alternative to Machine Learning","authors":"M. Czasonis, M. Kritzman, D. Turkington","doi":"10.3905/jfds.2022.1.110","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.110","url":null,"abstract":"The authors describe a new prediction system based on relevance, which gives a mathematically precise measure of the importance of an observation to forming a prediction, as well as fit, which measures a specific prediction’s reliability. They show how their relevance-based approach to prediction identifies the optimal combination of observations and predictive variables for any given prediction task, thereby presenting a unified alternative to both kernel regression and lasso regression, which they call CKT regression. They argue that their new prediction system addresses complexities that are beyond the capacity of linear regression analysis but in a way that is more transparent, more flexible, and less arbitrary than widely used machine learning algorithms.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, the author proposes a modified version of PHATE, a diffusion map-based embedding algorithm that is tuned for working on financial time-series data primarily. The new algorithm, financial affinity-based diffusion transition embedding (FATE), takes in user-specified distance metrics that make sense for time-series data and uses symmetrized f-divergences applied to the diffusion probabilities as the final embedding distance before passing them into a metric multidimensional scaling step. The proposed visualization method reveals both local and global structures of the input time-series dataset. Performance of this visualization algorithm is first demonstrated through numerical experiments with Dow Jones 30 stock returns and S&P 100 stock returns. The author compares FATE visualization results using correlation-type distances with t-stochastic neighbor embedding and PHATE embeddings, among others, to demonstrate the advantages and new perspectives of FATE both qualitatively and quantitatively. On the other hand, experiments on synthetic ARMA time series with fine control of the structure of the underlying model parameters are provided. The results demonstrate the ability of transfer function information distance and time-lagged Hellinger distance to identify structures within the generating time-series models from their time-series realizations alone, which cannot be identified by correlation-type distances or Euclidean distances. The author concludes that the choice of distance metrics has an important role in the kind of structure one can uncover from time-series datasets.
{"title":"Visualizing Structures in Financial Time-Series Datasets through Affinity-Based Diffusion Transition Embedding","authors":"Rui Ding","doi":"10.3905/jfds.2022.1.111","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.111","url":null,"abstract":"In this work, the author proposes a modified version of PHATE, a diffusion map-based embedding algorithm that is tuned for working on financial time-series data primarily. The new algorithm, financial affinity-based diffusion transition embedding (FATE), takes in user-specified distance metrics that make sense for time-series data and uses symmetrized f-divergences applied to the diffusion probabilities as the final embedding distance before passing them into a metric multidimensional scaling step. The proposed visualization method reveals both local and global structures of the input time-series dataset. Performance of this visualization algorithm is first demonstrated through numerical experiments with Dow Jones 30 stock returns and S&P 100 stock returns. The author compares FATE visualization results using correlation-type distances with t-stochastic neighbor embedding and PHATE embeddings, among others, to demonstrate the advantages and new perspectives of FATE both qualitatively and quantitatively. On the other hand, experiments on synthetic ARMA time series with fine control of the structure of the underlying model parameters are provided. The results demonstrate the ability of transfer function information distance and time-lagged Hellinger distance to identify structures within the generating time-series models from their time-series realizations alone, which cannot be identified by correlation-type distances or Euclidean distances. The author concludes that the choice of distance metrics has an important role in the kind of structure one can uncover from time-series datasets.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115169079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-31DOI: 10.3905/jfds.2022.4.4.025
Han-Tai Shiao, Cynthia A. Pagliaro, D. Mehta
During periods of extreme market volatility, such as that experienced during the COVID-19 pandemic, advised investors may consider impulsive and inappropriate investment decisions like moving all assets to cash. Financial advisors, through proactive behavioral coaching, can help their clients avoid such decisions. But which clients need the most help? A predictive model that better identifies the clients most likely to react to market volatility can be an invaluable tool for financial advisors. Such a model requires insight into the investors’ mindset. In previous work, the authors focused on the perspective of the financial advisor and used natural language processing to explore advisors’ summary notes to extract such investor insights. They then used this novel data source as input for a machine-learning model to predict the investors most in need of intervention during volatile market periods. In this article, the authors further expand the model to include a unique dataset of investors’ digital activity, including investor-initiated contacts (via web, email, and phone) and web activity (page view and browsing history), to better reveal investor intention. Using machine-learning techniques, the authors build a model using this novel dataset as well as advisor notes, transaction activity, and a market volatility index to identify advised investors most in need of proactive intervention. The authors further describe the implication such work has for both traditional and robo-advisory service models.
{"title":"Using Machine Learning to Model Advised-Investor Behavior","authors":"Han-Tai Shiao, Cynthia A. Pagliaro, D. Mehta","doi":"10.3905/jfds.2022.4.4.025","DOIUrl":"https://doi.org/10.3905/jfds.2022.4.4.025","url":null,"abstract":"During periods of extreme market volatility, such as that experienced during the COVID-19 pandemic, advised investors may consider impulsive and inappropriate investment decisions like moving all assets to cash. Financial advisors, through proactive behavioral coaching, can help their clients avoid such decisions. But which clients need the most help? A predictive model that better identifies the clients most likely to react to market volatility can be an invaluable tool for financial advisors. Such a model requires insight into the investors’ mindset. In previous work, the authors focused on the perspective of the financial advisor and used natural language processing to explore advisors’ summary notes to extract such investor insights. They then used this novel data source as input for a machine-learning model to predict the investors most in need of intervention during volatile market periods. In this article, the authors further expand the model to include a unique dataset of investors’ digital activity, including investor-initiated contacts (via web, email, and phone) and web activity (page view and browsing history), to better reveal investor intention. Using machine-learning techniques, the authors build a model using this novel dataset as well as advisor notes, transaction activity, and a market volatility index to identify advised investors most in need of proactive intervention. The authors further describe the implication such work has for both traditional and robo-advisory service models.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128218491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando de Meer Pardo, Peter Schwendner, Marcus Wunsch
Generative adversarial networks (GANs) have been shown to be able to generate samples of complex financial time series, particularly by employing the concept of path signatures, a universal description of the geometric properties of a data stream whose expected value uniquely characterizes the time series. Specifically, the SigCWGAN model (Ni et al. 2020) can generate time series of arbitrary length; however, the parameters of the neural network employed grow exponentially with the dimension of the underlying time series, which makes the model intractable when seeking to generate large financial market scenarios. To overcome this problem of dimensionality, the authors propose an iterative generation procedure relying on the concept of hierarchies in financial markets. The authors construct an ensemble of GANs that they call the Hierarchical-SigCWGAN, which is based on hierarchical clustering that approximates signatures in the spirit of the original model. The Hierarchical-SigCWGAN can scale to higher dimensions and generate large-dimensional scenarios in which the joint behavior of all the assets in the market is replicated. The model is validated by comparing its performance on a series of similarity metrics with respect to the original SigCWGAN on a dataset in which it is still tractable and by showing its scalability on a larger dataset.
生成对抗网络(gan)已被证明能够生成复杂金融时间序列的样本,特别是通过采用路径签名的概念,路径签名是对数据流的几何属性的通用描述,其期望值是时间序列的唯一特征。具体来说,SigCWGAN模型(Ni et al. 2020)可以生成任意长度的时间序列;然而,所使用的神经网络的参数随着底层时间序列的维度呈指数增长,这使得模型在寻求生成大型金融市场场景时难以处理。为了克服这一维度问题,作者提出了一种基于金融市场层次概念的迭代生成过程。作者构建了一个gan的集合,他们称之为hierarchical - sigcwgan,它基于层次聚类,在原始模型的精神中近似签名。Hierarchical-SigCWGAN可以扩展到更高的维度,并生成大维度的场景,在这些场景中,市场中所有资产的联合行为被复制。通过在仍然可处理的数据集上比较其与原始SigCWGAN在一系列相似性指标上的性能,并通过在更大的数据集上显示其可伸缩性来验证该模型。
{"title":"Tackling the Exponential Scaling of Signature-Based Generative Adversarial Networks for High-Dimensional Financial Time-Series Generation","authors":"Fernando de Meer Pardo, Peter Schwendner, Marcus Wunsch","doi":"10.3905/jfds.2022.1.109","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.109","url":null,"abstract":"Generative adversarial networks (GANs) have been shown to be able to generate samples of complex financial time series, particularly by employing the concept of path signatures, a universal description of the geometric properties of a data stream whose expected value uniquely characterizes the time series. Specifically, the SigCWGAN model (Ni et al. 2020) can generate time series of arbitrary length; however, the parameters of the neural network employed grow exponentially with the dimension of the underlying time series, which makes the model intractable when seeking to generate large financial market scenarios. To overcome this problem of dimensionality, the authors propose an iterative generation procedure relying on the concept of hierarchies in financial markets. The authors construct an ensemble of GANs that they call the Hierarchical-SigCWGAN, which is based on hierarchical clustering that approximates signatures in the spirit of the original model. The Hierarchical-SigCWGAN can scale to higher dimensions and generate large-dimensional scenarios in which the joint behavior of all the assets in the market is replicated. The model is validated by comparing its performance on a series of similarity metrics with respect to the original SigCWGAN on a dataset in which it is still tractable and by showing its scalability on a larger dataset.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115049795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Separating the side and size of a position allows for sophisticated strategy structures to be developed. Modeling the size component can be done through a meta-labeling approach. This article establishes several heterogeneous architectures to account for key aspects of meta-labeling. They serve as a guide for practitioners in the model development process, as well as for researchers to further build on these ideas. An architecture can be developed through the lens of feature- and/or strategy-driven approaches. The feature-driven approach exploits the way the information in the data is structured and how the selected models use that information, whereas a strategy-driven approach specifically aims to incorporate unique characteristics of the underlying trading strategy. Furthermore, the concept of inverse meta-labeling is introduced as a technique to improve the quantity and quality of the side forecasts.
{"title":"Meta-Labeling Architecture","authors":"M. Meyer, J. Joubert, Mesias Alfeus","doi":"10.3905/jfds.2022.1.108","DOIUrl":"https://doi.org/10.3905/jfds.2022.1.108","url":null,"abstract":"Separating the side and size of a position allows for sophisticated strategy structures to be developed. Modeling the size component can be done through a meta-labeling approach. This article establishes several heterogeneous architectures to account for key aspects of meta-labeling. They serve as a guide for practitioners in the model development process, as well as for researchers to further build on these ideas. An architecture can be developed through the lens of feature- and/or strategy-driven approaches. The feature-driven approach exploits the way the information in the data is structured and how the selected models use that information, whereas a strategy-driven approach specifically aims to incorporate unique characteristics of the underlying trading strategy. Furthermore, the concept of inverse meta-labeling is introduced as a technique to improve the quantity and quality of the side forecasts.","PeriodicalId":199045,"journal":{"name":"The Journal of Financial Data Science","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130273834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}