Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte
{"title":"Large-Scale Pretraining and Finetuning for Efficient Jet Classification in Particle Physics","authors":"Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte","doi":"arxiv-2408.09343","DOIUrl":null,"url":null,"abstract":"This study introduces an innovative approach to analyzing unlabeled data in\nhigh-energy physics (HEP) through the application of self-supervised learning\n(SSL). Faced with the increasing computational cost of producing high-quality\nlabeled simulation samples at the CERN LHC, we propose leveraging large volumes\nof unlabeled data to overcome the limitations of supervised learning methods,\nwhich heavily rely on detailed labeled simulations. By pretraining models on\nthese vast, mostly untapped datasets, we aim to learn generic representations\nthat can be finetuned with smaller quantities of labeled data. Our methodology\nemploys contrastive learning with augmentations on jet datasets to teach the\nmodel to recognize common representations of jets, addressing the unique\nchallenges of LHC physics. Building on the groundwork laid by previous studies,\nour work demonstrates the critical ability of SSL to utilize large-scale\nunlabeled data effectively. We showcase the scalability and effectiveness of\nour models by gradually increasing the size of the pretraining dataset and\nassessing the resultant performance enhancements. Our results, obtained from\nexperiments on two datasets -- JetClass, representing unlabeled data, and Top\nTagging, serving as labeled simulation data -- show significant improvements in\ndata efficiency, computational efficiency, and overall performance. These\nfindings suggest that SSL can greatly enhance the adaptability of ML models to\nthe HEP domain. This work opens new avenues for the use of unlabeled data in\nHEP and contributes to a better understanding the potential of SSL for\nscientific discovery.","PeriodicalId":501065,"journal":{"name":"arXiv - PHYS - Data Analysis, Statistics and Probability","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Data Analysis, Statistics and Probability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study introduces an innovative approach to analyzing unlabeled data in
high-energy physics (HEP) through the application of self-supervised learning
(SSL). Faced with the increasing computational cost of producing high-quality
labeled simulation samples at the CERN LHC, we propose leveraging large volumes
of unlabeled data to overcome the limitations of supervised learning methods,
which heavily rely on detailed labeled simulations. By pretraining models on
these vast, mostly untapped datasets, we aim to learn generic representations
that can be finetuned with smaller quantities of labeled data. Our methodology
employs contrastive learning with augmentations on jet datasets to teach the
model to recognize common representations of jets, addressing the unique
challenges of LHC physics. Building on the groundwork laid by previous studies,
our work demonstrates the critical ability of SSL to utilize large-scale
unlabeled data effectively. We showcase the scalability and effectiveness of
our models by gradually increasing the size of the pretraining dataset and
assessing the resultant performance enhancements. Our results, obtained from
experiments on two datasets -- JetClass, representing unlabeled data, and Top
Tagging, serving as labeled simulation data -- show significant improvements in
data efficiency, computational efficiency, and overall performance. These
findings suggest that SSL can greatly enhance the adaptability of ML models to
the HEP domain. This work opens new avenues for the use of unlabeled data in
HEP and contributes to a better understanding the potential of SSL for
scientific discovery.