Moritz Stang, Bastian Krämer, Marcelo Cajias, Wolfgang Schäfers
{"title":"改变位置游戏-在可解释的AI的帮助下改善位置分析","authors":"Moritz Stang, Bastian Krämer, Marcelo Cajias, Wolfgang Schäfers","doi":"10.1080/08965803.2023.2258012","DOIUrl":null,"url":null,"abstract":"AbstractBesides its structural and economic characteristics, the location of a property is probably one of the most important determinants of its underlying value. In contrast to property valuations, there are hardly any approaches to date that evaluate the quality of a real estate location in an automated manner. The reasons are the complexity, the number of interactions and the non-linearities underlying the quality specifications of a certain location. By combining a state-of-the-art machine learning algorithm and the local post-hoc model agnostic method of Shapley Additive Explanations, this paper introduces a newly developed approach – called SHAP location score – that is able to detect these complexities and enables assessing real estate locations in a data-based manner. The SHAP location score represents an intuitive and flexible approach based on econometric modeling techniques and the basic assumptions of hedonic pricing theory. The approach can be applied post-hoc to any common machine learning method and can be flexibly adapted to the respective needs. This constitutes a significant extension of traditional urban models and offers many advantages for a wide range of real estate players.Keywords: Location AnalyticsExplainable AIMachine LearningShapley ValuesAutomated LocationValuation Model Disclosure StatementNo potential conflict of interest was reported by the author(s).Notes1 This term describes the fact that this technique is applied after the actual training of an algorithm (= post-hoc) and can be applied for different algorithms (= model-agnostic).2 In the context of the SHAP-LS methodology, it is in principle possible to use both purchase or rental prices. Both reflect the observable willingness to pay for a property with certain characteristics and a certain location and can thus be used in this logic in an arbitrary manner.3 An example of the identification of aggregated results would be the Permutation Feature Importance (see e.g., Krämer, Nagl, et al., Citation2023).4 Theoretically, the SHAP-LS of single features could be used for this kind of analysis. However, this is not recommended, as one is exposed to the capriciousness of the algorithms and data providers. Often, locational features such as the distance to the next bus stop and to the next subway station correlate highly. Consequently, the algorithm cannot distinguish perfectly between these correlated features, which can lead to a blurring of the individual SHAP values. Another reason that should not be neglected is the dependence on the categorization of the location characteristics of the data providers. In some cases, individual amenities overlap considerably, e.g., the classification of restaurants, pubs or bars. Combining several individual characteristics into categories can counteract this blurring. As a rule of thumb, it can be stated that the more data available, the smaller the categories that can be used.5 It should be noted that the overall and categorical SHAP-LSs can also be utilized without scaling, providing absolute values instead of relative comparisons. This alternative application of the SHAP-LS methodology, the absolute SHAP-LS, leads to the derivation of absolute contributions of location-specific features, expressed in the unit of the property's price. This interpretation presents a quantifiable measure of the value placed on the location characteristics compared to the average prediction by market participants, thus providing an additional perspective for analysis. It also broadens the scope for data interpretation, thereby offering new ways for investigation within the realms of real estate valuation and location quality assessment. An exemplary implementation of this can be found in Appendix VI.6 A further application in other asset classes, such as retail, industrial, hotel, or office, is also theoretically possible but beyond the scope of this study.7 Since the SHAP-LS methodology is used to analyze the extent of influence of individual location-descriptive features, there is no need to transform the negative POIs. No statement is made or needed beforehand regarding whether a greater distance to these POIs generally has a positive or negative impact on the quality of a location. The term \"negative POIs\" is simply chosen because they are generally perceived as negative by humans. The logic and use of these POIs are therefore arbitrary in relation to the other POIs.8 To obtain shap values of the features of interest the shap package (https://shap.readthedocs.io/en/latest/index.html) is used.9 To ensure the reliability of results obtained from the newly introduced SHAP-LS across various model specifications, hyperparameter choices, and other potential variations, two distinct robustness checks are performed. The results can be seen in Appendix III.10 To verify that the SHAP-LS does not merely reflect the rental price per square meter, the correlation between the SHAP-LS and the rent per square meter is calculated. A coefficient of 0.58 signifies that the SHAP-LS effectively captures the relative attractiveness of a location, rather than solely relying only on the prices. The presence of a discernible correlation between the appeal of a location and the rental rates should not be regarded as unexpected.11 A corresponding example can be seen in Appendix IV.12 Another potential application for practitioners would be the further implementation of the SHAP-LS within an automated real estate valuation process to further improve their valuation accuracy. An empirical example of this can be found in Appendix V.13 As outlined in the methodology section, the SHAP-LS technique can be employed for location analysis without the necessity for scaling to gain further insights. An illustrative example of this approach which we call absolute SHAP-LS can be found in Appendix VI.14 In the context of the SHAP-LS, the correlation between different features can be seen as only a minor concern because the results are presented at an aggregated overall or group level. The individual interpretation challenges of specific features are thus avoided.15 Compared to the empirical example conducted in the upper part of the study, this illustrative demonstration utilized data not only from the year 2020 but also from the years 2021 and 2022. This allows for the implementation of the moving window approach and further enables a more robust testing of the results.16 To utilize the SHAP-LSs in the context of automated real estate valuation, they need to be computed beforehand. To enable a fair comparison, the calculation of SHAP-LSs for the test data is performed out-of-sample. In the first step, following the logic outlined in the methodology section, the SHAP-LSs are calculated for a training dataset. In the second step, the SHAP-LSs for the unseen test data are determined using a k-nearest neighbors approach (k = 5) based on the previously computed SHAP-LSs of the training data. Therefore, our approach can be seen as a feature selection and feature aggregation method, as it systematically identifies and incorporates the most relevant locational variables into the model. By concentrating on key features, our methodology minimizes the risk of overfitting, leading to models that potentially generalize better to new data.17 The NUTS (Nomenclature of territorial units for statistics) classification is a hierarchical system for dividing up the economic territory of the EU and the UK. Overall, there are four different subdivision levels, called NUTS-0, NUTS-1, NUTS-2 and NUTS-3. The NUTS-3 regions in the UK constitute local administrative units including counties, unitary authorities and London boroughs. For a more detailed about the NUTS regions, we refer to Krämer, Stang, et al. (Citation2023).18 The same logic is applied to calculate the SHAP-LSs.","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Changing the Location Game – Improving Location Analytics with the Help of Explainable AI\",\"authors\":\"Moritz Stang, Bastian Krämer, Marcelo Cajias, Wolfgang Schäfers\",\"doi\":\"10.1080/08965803.2023.2258012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AbstractBesides its structural and economic characteristics, the location of a property is probably one of the most important determinants of its underlying value. In contrast to property valuations, there are hardly any approaches to date that evaluate the quality of a real estate location in an automated manner. The reasons are the complexity, the number of interactions and the non-linearities underlying the quality specifications of a certain location. By combining a state-of-the-art machine learning algorithm and the local post-hoc model agnostic method of Shapley Additive Explanations, this paper introduces a newly developed approach – called SHAP location score – that is able to detect these complexities and enables assessing real estate locations in a data-based manner. The SHAP location score represents an intuitive and flexible approach based on econometric modeling techniques and the basic assumptions of hedonic pricing theory. The approach can be applied post-hoc to any common machine learning method and can be flexibly adapted to the respective needs. This constitutes a significant extension of traditional urban models and offers many advantages for a wide range of real estate players.Keywords: Location AnalyticsExplainable AIMachine LearningShapley ValuesAutomated LocationValuation Model Disclosure StatementNo potential conflict of interest was reported by the author(s).Notes1 This term describes the fact that this technique is applied after the actual training of an algorithm (= post-hoc) and can be applied for different algorithms (= model-agnostic).2 In the context of the SHAP-LS methodology, it is in principle possible to use both purchase or rental prices. Both reflect the observable willingness to pay for a property with certain characteristics and a certain location and can thus be used in this logic in an arbitrary manner.3 An example of the identification of aggregated results would be the Permutation Feature Importance (see e.g., Krämer, Nagl, et al., Citation2023).4 Theoretically, the SHAP-LS of single features could be used for this kind of analysis. However, this is not recommended, as one is exposed to the capriciousness of the algorithms and data providers. Often, locational features such as the distance to the next bus stop and to the next subway station correlate highly. Consequently, the algorithm cannot distinguish perfectly between these correlated features, which can lead to a blurring of the individual SHAP values. Another reason that should not be neglected is the dependence on the categorization of the location characteristics of the data providers. In some cases, individual amenities overlap considerably, e.g., the classification of restaurants, pubs or bars. Combining several individual characteristics into categories can counteract this blurring. As a rule of thumb, it can be stated that the more data available, the smaller the categories that can be used.5 It should be noted that the overall and categorical SHAP-LSs can also be utilized without scaling, providing absolute values instead of relative comparisons. This alternative application of the SHAP-LS methodology, the absolute SHAP-LS, leads to the derivation of absolute contributions of location-specific features, expressed in the unit of the property's price. This interpretation presents a quantifiable measure of the value placed on the location characteristics compared to the average prediction by market participants, thus providing an additional perspective for analysis. It also broadens the scope for data interpretation, thereby offering new ways for investigation within the realms of real estate valuation and location quality assessment. An exemplary implementation of this can be found in Appendix VI.6 A further application in other asset classes, such as retail, industrial, hotel, or office, is also theoretically possible but beyond the scope of this study.7 Since the SHAP-LS methodology is used to analyze the extent of influence of individual location-descriptive features, there is no need to transform the negative POIs. No statement is made or needed beforehand regarding whether a greater distance to these POIs generally has a positive or negative impact on the quality of a location. The term \\\"negative POIs\\\" is simply chosen because they are generally perceived as negative by humans. The logic and use of these POIs are therefore arbitrary in relation to the other POIs.8 To obtain shap values of the features of interest the shap package (https://shap.readthedocs.io/en/latest/index.html) is used.9 To ensure the reliability of results obtained from the newly introduced SHAP-LS across various model specifications, hyperparameter choices, and other potential variations, two distinct robustness checks are performed. The results can be seen in Appendix III.10 To verify that the SHAP-LS does not merely reflect the rental price per square meter, the correlation between the SHAP-LS and the rent per square meter is calculated. A coefficient of 0.58 signifies that the SHAP-LS effectively captures the relative attractiveness of a location, rather than solely relying only on the prices. The presence of a discernible correlation between the appeal of a location and the rental rates should not be regarded as unexpected.11 A corresponding example can be seen in Appendix IV.12 Another potential application for practitioners would be the further implementation of the SHAP-LS within an automated real estate valuation process to further improve their valuation accuracy. An empirical example of this can be found in Appendix V.13 As outlined in the methodology section, the SHAP-LS technique can be employed for location analysis without the necessity for scaling to gain further insights. An illustrative example of this approach which we call absolute SHAP-LS can be found in Appendix VI.14 In the context of the SHAP-LS, the correlation between different features can be seen as only a minor concern because the results are presented at an aggregated overall or group level. The individual interpretation challenges of specific features are thus avoided.15 Compared to the empirical example conducted in the upper part of the study, this illustrative demonstration utilized data not only from the year 2020 but also from the years 2021 and 2022. This allows for the implementation of the moving window approach and further enables a more robust testing of the results.16 To utilize the SHAP-LSs in the context of automated real estate valuation, they need to be computed beforehand. To enable a fair comparison, the calculation of SHAP-LSs for the test data is performed out-of-sample. In the first step, following the logic outlined in the methodology section, the SHAP-LSs are calculated for a training dataset. In the second step, the SHAP-LSs for the unseen test data are determined using a k-nearest neighbors approach (k = 5) based on the previously computed SHAP-LSs of the training data. Therefore, our approach can be seen as a feature selection and feature aggregation method, as it systematically identifies and incorporates the most relevant locational variables into the model. By concentrating on key features, our methodology minimizes the risk of overfitting, leading to models that potentially generalize better to new data.17 The NUTS (Nomenclature of territorial units for statistics) classification is a hierarchical system for dividing up the economic territory of the EU and the UK. Overall, there are four different subdivision levels, called NUTS-0, NUTS-1, NUTS-2 and NUTS-3. The NUTS-3 regions in the UK constitute local administrative units including counties, unitary authorities and London boroughs. For a more detailed about the NUTS regions, we refer to Krämer, Stang, et al. (Citation2023).18 The same logic is applied to calculate the SHAP-LSs.\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2023-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/08965803.2023.2258012\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/08965803.2023.2258012","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
Changing the Location Game – Improving Location Analytics with the Help of Explainable AI
AbstractBesides its structural and economic characteristics, the location of a property is probably one of the most important determinants of its underlying value. In contrast to property valuations, there are hardly any approaches to date that evaluate the quality of a real estate location in an automated manner. The reasons are the complexity, the number of interactions and the non-linearities underlying the quality specifications of a certain location. By combining a state-of-the-art machine learning algorithm and the local post-hoc model agnostic method of Shapley Additive Explanations, this paper introduces a newly developed approach – called SHAP location score – that is able to detect these complexities and enables assessing real estate locations in a data-based manner. The SHAP location score represents an intuitive and flexible approach based on econometric modeling techniques and the basic assumptions of hedonic pricing theory. The approach can be applied post-hoc to any common machine learning method and can be flexibly adapted to the respective needs. This constitutes a significant extension of traditional urban models and offers many advantages for a wide range of real estate players.Keywords: Location AnalyticsExplainable AIMachine LearningShapley ValuesAutomated LocationValuation Model Disclosure StatementNo potential conflict of interest was reported by the author(s).Notes1 This term describes the fact that this technique is applied after the actual training of an algorithm (= post-hoc) and can be applied for different algorithms (= model-agnostic).2 In the context of the SHAP-LS methodology, it is in principle possible to use both purchase or rental prices. Both reflect the observable willingness to pay for a property with certain characteristics and a certain location and can thus be used in this logic in an arbitrary manner.3 An example of the identification of aggregated results would be the Permutation Feature Importance (see e.g., Krämer, Nagl, et al., Citation2023).4 Theoretically, the SHAP-LS of single features could be used for this kind of analysis. However, this is not recommended, as one is exposed to the capriciousness of the algorithms and data providers. Often, locational features such as the distance to the next bus stop and to the next subway station correlate highly. Consequently, the algorithm cannot distinguish perfectly between these correlated features, which can lead to a blurring of the individual SHAP values. Another reason that should not be neglected is the dependence on the categorization of the location characteristics of the data providers. In some cases, individual amenities overlap considerably, e.g., the classification of restaurants, pubs or bars. Combining several individual characteristics into categories can counteract this blurring. As a rule of thumb, it can be stated that the more data available, the smaller the categories that can be used.5 It should be noted that the overall and categorical SHAP-LSs can also be utilized without scaling, providing absolute values instead of relative comparisons. This alternative application of the SHAP-LS methodology, the absolute SHAP-LS, leads to the derivation of absolute contributions of location-specific features, expressed in the unit of the property's price. This interpretation presents a quantifiable measure of the value placed on the location characteristics compared to the average prediction by market participants, thus providing an additional perspective for analysis. It also broadens the scope for data interpretation, thereby offering new ways for investigation within the realms of real estate valuation and location quality assessment. An exemplary implementation of this can be found in Appendix VI.6 A further application in other asset classes, such as retail, industrial, hotel, or office, is also theoretically possible but beyond the scope of this study.7 Since the SHAP-LS methodology is used to analyze the extent of influence of individual location-descriptive features, there is no need to transform the negative POIs. No statement is made or needed beforehand regarding whether a greater distance to these POIs generally has a positive or negative impact on the quality of a location. The term "negative POIs" is simply chosen because they are generally perceived as negative by humans. The logic and use of these POIs are therefore arbitrary in relation to the other POIs.8 To obtain shap values of the features of interest the shap package (https://shap.readthedocs.io/en/latest/index.html) is used.9 To ensure the reliability of results obtained from the newly introduced SHAP-LS across various model specifications, hyperparameter choices, and other potential variations, two distinct robustness checks are performed. The results can be seen in Appendix III.10 To verify that the SHAP-LS does not merely reflect the rental price per square meter, the correlation between the SHAP-LS and the rent per square meter is calculated. A coefficient of 0.58 signifies that the SHAP-LS effectively captures the relative attractiveness of a location, rather than solely relying only on the prices. The presence of a discernible correlation between the appeal of a location and the rental rates should not be regarded as unexpected.11 A corresponding example can be seen in Appendix IV.12 Another potential application for practitioners would be the further implementation of the SHAP-LS within an automated real estate valuation process to further improve their valuation accuracy. An empirical example of this can be found in Appendix V.13 As outlined in the methodology section, the SHAP-LS technique can be employed for location analysis without the necessity for scaling to gain further insights. An illustrative example of this approach which we call absolute SHAP-LS can be found in Appendix VI.14 In the context of the SHAP-LS, the correlation between different features can be seen as only a minor concern because the results are presented at an aggregated overall or group level. The individual interpretation challenges of specific features are thus avoided.15 Compared to the empirical example conducted in the upper part of the study, this illustrative demonstration utilized data not only from the year 2020 but also from the years 2021 and 2022. This allows for the implementation of the moving window approach and further enables a more robust testing of the results.16 To utilize the SHAP-LSs in the context of automated real estate valuation, they need to be computed beforehand. To enable a fair comparison, the calculation of SHAP-LSs for the test data is performed out-of-sample. In the first step, following the logic outlined in the methodology section, the SHAP-LSs are calculated for a training dataset. In the second step, the SHAP-LSs for the unseen test data are determined using a k-nearest neighbors approach (k = 5) based on the previously computed SHAP-LSs of the training data. Therefore, our approach can be seen as a feature selection and feature aggregation method, as it systematically identifies and incorporates the most relevant locational variables into the model. By concentrating on key features, our methodology minimizes the risk of overfitting, leading to models that potentially generalize better to new data.17 The NUTS (Nomenclature of territorial units for statistics) classification is a hierarchical system for dividing up the economic territory of the EU and the UK. Overall, there are four different subdivision levels, called NUTS-0, NUTS-1, NUTS-2 and NUTS-3. The NUTS-3 regions in the UK constitute local administrative units including counties, unitary authorities and London boroughs. For a more detailed about the NUTS regions, we refer to Krämer, Stang, et al. (Citation2023).18 The same logic is applied to calculate the SHAP-LSs.
期刊介绍:
Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance.
Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.