At this stage development of recommender systems (RS), an evaluation of competing approaches (methods) yielding similar performances in terms of experiment reproduction is of crucial importance in order to direct the further development toward the most promising direction. These comparisons are usually based on the 10-fold cross validation scheme. Since the compared performances are often similar to each other, the application of statistical significance testing is inevitable in order to not to get misled by randomly caused differences of achieved performances. For the same reason, to reproduce experiments on a different set of experimental data, the most powerful significance testing should be applied. In this work we provide guidelines on how to achieve the highest power in the comparison of RS and we demonstrate them on a comparison of RS performances when different variables are contextualized.
{"title":"How to improve the statistical power of the 10-fold cross validation scheme in recommender systems","authors":"A. Košir, Ante Odic, M. Tkalcic","doi":"10.1145/2532508.2532510","DOIUrl":"https://doi.org/10.1145/2532508.2532510","url":null,"abstract":"At this stage development of recommender systems (RS), an evaluation of competing approaches (methods) yielding similar performances in terms of experiment reproduction is of crucial importance in order to direct the further development toward the most promising direction. These comparisons are usually based on the 10-fold cross validation scheme. Since the compared performances are often similar to each other, the application of statistical significance testing is inevitable in order to not to get misled by randomly caused differences of achieved performances. For the same reason, to reproduce experiments on a different set of experimental data, the most powerful significance testing should be applied. In this work we provide guidelines on how to achieve the highest power in the comparison of RS and we demonstrate them on a comparison of RS performances when different variables are contextualized.","PeriodicalId":398648,"journal":{"name":"RepSys '13","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134370032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A large-scale offline evaluation -- with a big money prize attached -- established recommender systems as a niche discipline worth researching, and one where robust and reproducible experiments would be easy. But since then critiques within academia have shown up shortcomings in the most appealingly objective evaluation metrics, war stories from the commercial front line have suggested that correlation between offline metrics and bottom line gains in production may be non-existent, and several subsequent academic competitions have come under fierce criticism from both advisors and participants. In this talk I will draw on practical experience at Last.fm and Mendeley, as well as insights from others, to offer some opinions about offline evaluation of recommender systems: whether we still need it all, what value we can hope to draw from it, how best to do it if we have to, and how to make the experience less painful than it is right now.
{"title":"Offline evaluation of recommender systems: all pain and no gain?","authors":"M. Levy","doi":"10.1145/2532508.2532509","DOIUrl":"https://doi.org/10.1145/2532508.2532509","url":null,"abstract":"A large-scale offline evaluation -- with a big money prize attached -- established recommender systems as a niche discipline worth researching, and one where robust and reproducible experiments would be easy. But since then critiques within academia have shown up shortcomings in the most appealingly objective evaluation metrics, war stories from the commercial front line have suggested that correlation between offline metrics and bottom line gains in production may be non-existent, and several subsequent academic competitions have come under fierce criticism from both advisors and participants.\u0000 In this talk I will draw on practical experience at Last.fm and Mendeley, as well as insights from others, to offer some opinions about offline evaluation of recommender systems: whether we still need it all, what value we can hope to draw from it, how best to do it if we have to, and how to make the experience less painful than it is right now.","PeriodicalId":398648,"journal":{"name":"RepSys '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130528621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Beel, Marcel Genzmehr, Stefan Langer, A. Nürnberger, Bela Gipp
Offline evaluations are the most common evaluation method for research paper recommender systems. However, no thorough discussion on the appropriateness of offline evaluations has taken place, despite some voiced criticism. We conducted a study in which we evaluated various recommendation approaches with both offline and online evaluations. We found that results of offline and online evaluations often contradict each other. We discuss this finding in detail and conclude that offline evaluations may be inappropriate for evaluating research paper recommender systems, in many settings.
{"title":"A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation","authors":"J. Beel, Marcel Genzmehr, Stefan Langer, A. Nürnberger, Bela Gipp","doi":"10.1145/2532508.2532511","DOIUrl":"https://doi.org/10.1145/2532508.2532511","url":null,"abstract":"Offline evaluations are the most common evaluation method for research paper recommender systems. However, no thorough discussion on the appropriateness of offline evaluations has taken place, despite some voiced criticism. We conducted a study in which we evaluated various recommendation approaches with both offline and online evaluations. We found that results of offline and online evaluations often contradict each other. We discuss this finding in detail and conclude that offline evaluations may be inappropriate for evaluating research paper recommender systems, in many settings.","PeriodicalId":398648,"journal":{"name":"RepSys '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131337362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, A. Nürnberger
Over 80 approaches for academic literature recommendation exist today. The approaches were introduced and evaluated in more than 170 research articles, as well as patents, presentations and blogs. We reviewed these approaches and found most evaluations to contain major shortcomings. Of the approaches proposed, 21% were not evaluated. Among the evaluated approaches, 19% were not evaluated against a baseline. Of the user studies performed, 60% had 15 or fewer participants or did not report on the number of participants. Information on runtime and coverage was rarely provided. Due to these and several other shortcomings described in this paper, we conclude that it is currently not possible to determine which recommendation approaches for academic literature are the most promising. However, there is little value in the existence of more than 80 approaches if the best performing approaches are unknown.
{"title":"Research paper recommender system evaluation: a quantitative literature survey","authors":"Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, A. Nürnberger","doi":"10.1145/2532508.2532512","DOIUrl":"https://doi.org/10.1145/2532508.2532512","url":null,"abstract":"Over 80 approaches for academic literature recommendation exist today. The approaches were introduced and evaluated in more than 170 research articles, as well as patents, presentations and blogs. We reviewed these approaches and found most evaluations to contain major shortcomings. Of the approaches proposed, 21% were not evaluated. Among the evaluated approaches, 19% were not evaluated against a baseline. Of the user studies performed, 60% had 15 or fewer participants or did not report on the number of participants. Information on runtime and coverage was rarely provided. Due to these and several other shortcomings described in this paper, we conclude that it is currently not possible to determine which recommendation approaches for academic literature are the most promising. However, there is little value in the existence of more than 80 approaches if the best performing approaches are unknown.","PeriodicalId":398648,"journal":{"name":"RepSys '13","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123466537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the goals of data-intensive research, in any field of study, is to grow knowledge over time as additional studies contribute to collective knowledge and understanding. Two steps are critical to making such research cumulative -- the individual research results need to be documented thoroughly and conducted on data made available to others (to allow replication and meta-analysis), and the individual research needs to be carried out correctly, following standards and best practices for coding, missing data, algorithm choices, algorithm implementations, metrics, and statistics. This work aims to address a growing concern that the Recommender Systems research community (which is uniquely equipped to address many important challenges in electronic commerce, social networks, social media, and big data settings) is facing a crisis where a significant number of research papers lack the rigor and evaluation to be properly judged and, therefore, have little to contribute to collective knowledge. We advocate that this issue can be addressed through development and dissemination (to authors, reviewers, and editors) of best-practice research methodologies, resulting in specific guidelines and checklists, as well as through tool development to support effective research. We also plan to assess the impact on the field with an eye toward supporting such efforts in other data-intensive specialties.
{"title":"Toward identification and adoption of best practices in algorithmic recommender systems research","authors":"J. Konstan, G. Adomavicius","doi":"10.1145/2532508.2532513","DOIUrl":"https://doi.org/10.1145/2532508.2532513","url":null,"abstract":"One of the goals of data-intensive research, in any field of study, is to grow knowledge over time as additional studies contribute to collective knowledge and understanding. Two steps are critical to making such research cumulative -- the individual research results need to be documented thoroughly and conducted on data made available to others (to allow replication and meta-analysis), and the individual research needs to be carried out correctly, following standards and best practices for coding, missing data, algorithm choices, algorithm implementations, metrics, and statistics. This work aims to address a growing concern that the Recommender Systems research community (which is uniquely equipped to address many important challenges in electronic commerce, social networks, social media, and big data settings) is facing a crisis where a significant number of research papers lack the rigor and evaluation to be properly judged and, therefore, have little to contribute to collective knowledge. We advocate that this issue can be addressed through development and dissemination (to authors, reviewers, and editors) of best-practice research methodologies, resulting in specific guidelines and checklists, as well as through tool development to support effective research. We also plan to assess the impact on the field with an eye toward supporting such efforts in other data-intensive specialties.","PeriodicalId":398648,"journal":{"name":"RepSys '13","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132114095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}