Negar Arabzadeh, A. Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, E. Bagheri
{"title":"天作之合:监督查询重构的工具箱和大规模数据集","authors":"Negar Arabzadeh, A. Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, E. Bagheri","doi":"10.1145/3459637.3482009","DOIUrl":null,"url":null,"abstract":"Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion through pseudo-relevance feedback. More recently and with the increasing effectiveness of neural sequence-to-sequence architectures, the problem of query reformulation has been studied as a supervised query translation problem, which learns to rewrite a query into a more effective alternative. While quite effective in practice, such supervised query reformulation methods require a large number of training instances. In this paper, we present three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset. The Diamond dataset consists of over 188,000 query pairs where the original source query is matched with an alternative query that has a perfect retrieval effectiveness (an average precision of 1). To the best of our knowledge, this is the first set of datasets for supervised query reformulation that offers perfect query reformulations for a large number of queries. The implementation of our fully automated tool, which is based on a transformer architecture, and our three datasets are made publicly available. We also establish a neural query reformulation baseline performance on our datasets by reporting the performance of strong neural query reformulation baselines. It is our belief that our datasets will significantly impact the development of supervised query reformulation methods in the future.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation\",\"authors\":\"Negar Arabzadeh, A. Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, E. Bagheri\",\"doi\":\"10.1145/3459637.3482009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion through pseudo-relevance feedback. More recently and with the increasing effectiveness of neural sequence-to-sequence architectures, the problem of query reformulation has been studied as a supervised query translation problem, which learns to rewrite a query into a more effective alternative. While quite effective in practice, such supervised query reformulation methods require a large number of training instances. In this paper, we present three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset. The Diamond dataset consists of over 188,000 query pairs where the original source query is matched with an alternative query that has a perfect retrieval effectiveness (an average precision of 1). To the best of our knowledge, this is the first set of datasets for supervised query reformulation that offers perfect query reformulations for a large number of queries. The implementation of our fully automated tool, which is based on a transformer architecture, and our three datasets are made publicly available. We also establish a neural query reformulation baseline performance on our datasets by reporting the performance of strong neural query reformulation baselines. It is our belief that our datasets will significantly impact the development of supervised query reformulation methods in the future.\",\"PeriodicalId\":405296,\"journal\":{\"name\":\"Proceedings of the 30th ACM International Conference on Information & Knowledge Management\",\"volume\":\"85 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 30th ACM International Conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3459637.3482009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3459637.3482009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation
Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion through pseudo-relevance feedback. More recently and with the increasing effectiveness of neural sequence-to-sequence architectures, the problem of query reformulation has been studied as a supervised query translation problem, which learns to rewrite a query into a more effective alternative. While quite effective in practice, such supervised query reformulation methods require a large number of training instances. In this paper, we present three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset. The Diamond dataset consists of over 188,000 query pairs where the original source query is matched with an alternative query that has a perfect retrieval effectiveness (an average precision of 1). To the best of our knowledge, this is the first set of datasets for supervised query reformulation that offers perfect query reformulations for a large number of queries. The implementation of our fully automated tool, which is based on a transformer architecture, and our three datasets are made publicly available. We also establish a neural query reformulation baseline performance on our datasets by reporting the performance of strong neural query reformulation baselines. It is our belief that our datasets will significantly impact the development of supervised query reformulation methods in the future.