Large-Scale Pretraining and Finetuning for Efficient Jet Classification in Particle Physics

Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte
{"title":"Large-Scale Pretraining and Finetuning for Efficient Jet Classification in Particle Physics","authors":"Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte","doi":"arxiv-2408.09343","DOIUrl":null,"url":null,"abstract":"This study introduces an innovative approach to analyzing unlabeled data in\nhigh-energy physics (HEP) through the application of self-supervised learning\n(SSL). Faced with the increasing computational cost of producing high-quality\nlabeled simulation samples at the CERN LHC, we propose leveraging large volumes\nof unlabeled data to overcome the limitations of supervised learning methods,\nwhich heavily rely on detailed labeled simulations. By pretraining models on\nthese vast, mostly untapped datasets, we aim to learn generic representations\nthat can be finetuned with smaller quantities of labeled data. Our methodology\nemploys contrastive learning with augmentations on jet datasets to teach the\nmodel to recognize common representations of jets, addressing the unique\nchallenges of LHC physics. Building on the groundwork laid by previous studies,\nour work demonstrates the critical ability of SSL to utilize large-scale\nunlabeled data effectively. We showcase the scalability and effectiveness of\nour models by gradually increasing the size of the pretraining dataset and\nassessing the resultant performance enhancements. Our results, obtained from\nexperiments on two datasets -- JetClass, representing unlabeled data, and Top\nTagging, serving as labeled simulation data -- show significant improvements in\ndata efficiency, computational efficiency, and overall performance. These\nfindings suggest that SSL can greatly enhance the adaptability of ML models to\nthe HEP domain. This work opens new avenues for the use of unlabeled data in\nHEP and contributes to a better understanding the potential of SSL for\nscientific discovery.","PeriodicalId":501065,"journal":{"name":"arXiv - PHYS - Data Analysis, Statistics and Probability","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Data Analysis, Statistics and Probability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study introduces an innovative approach to analyzing unlabeled data in high-energy physics (HEP) through the application of self-supervised learning (SSL). Faced with the increasing computational cost of producing high-quality labeled simulation samples at the CERN LHC, we propose leveraging large volumes of unlabeled data to overcome the limitations of supervised learning methods, which heavily rely on detailed labeled simulations. By pretraining models on these vast, mostly untapped datasets, we aim to learn generic representations that can be finetuned with smaller quantities of labeled data. Our methodology employs contrastive learning with augmentations on jet datasets to teach the model to recognize common representations of jets, addressing the unique challenges of LHC physics. Building on the groundwork laid by previous studies, our work demonstrates the critical ability of SSL to utilize large-scale unlabeled data effectively. We showcase the scalability and effectiveness of our models by gradually increasing the size of the pretraining dataset and assessing the resultant performance enhancements. Our results, obtained from experiments on two datasets -- JetClass, representing unlabeled data, and Top Tagging, serving as labeled simulation data -- show significant improvements in data efficiency, computational efficiency, and overall performance. These findings suggest that SSL can greatly enhance the adaptability of ML models to the HEP domain. This work opens new avenues for the use of unlabeled data in HEP and contributes to a better understanding the potential of SSL for scientific discovery.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大规模预训练和微调,实现粒子物理中的高效射流分类
本研究介绍了一种通过应用自监督学习(SSL)来分析高能物理(HEP)中未标记数据的创新方法。面对欧洲核子研究中心大型强子对撞机(CERN LHC)制作高质量标签模拟样本的计算成本不断增加的问题,我们建议利用大量未标签数据来克服监督学习方法的局限性,因为监督学习方法严重依赖于详细的标签模拟。通过在这些庞大的、大部分尚未开发的数据集上预训练模型,我们的目标是学习通用表示法,然后再用较小数量的标注数据进行微调。我们的方法利用对比学习和喷流数据集上的增强来教模型识别喷流的常见表示,从而解决大型强子对撞机物理的独特挑战。在以往研究奠定的基础上,我们的工作展示了 SSL 有效利用大规模无标记数据的关键能力。我们通过逐步增加预训练数据集的规模和评估由此带来的性能提升,展示了我们模型的可扩展性和有效性。我们在两个数据集(代表无标签数据的 JetClass 和作为有标签模拟数据的 TopTagging)上的实验结果表明,数据效率、计算效率和整体性能都有显著提高。这些发现表明,SSL 可以大大提高 ML 模型在 HEP 领域的适应性。这项工作为在 HEP 中使用无标记数据开辟了新途径,有助于更好地了解 SSL 在科学发现方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
PASS: An Asynchronous Probabilistic Processor for Next Generation Intelligence Astrometric Binary Classification Via Artificial Neural Networks XENONnT Analysis: Signal Reconstruction, Calibration and Event Selection Converting sWeights to Probabilities with Density Ratios Challenges and perspectives in recurrence analyses of event time series
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1