{"title":"Anomaly-aware summary statistic from data batches","authors":"Gaia Grosso","doi":"arxiv-2407.01249","DOIUrl":null,"url":null,"abstract":"Signal-agnostic data exploration based on machine learning could unveil very\nsubtle statistical deviations of collider data from the expected Standard Model\nof particle physics. The beneficial impact of a large training sample on\nmachine learning solutions motivates the exploration of increasingly large and\ninclusive samples of acquired data with resource efficient computational\nmethods. In this work we consider the New Physics Learning Machine (NPLM), a\nmultivariate goodness-of-fit test built on the Neyman-Pearson\nmaximum-likelihood-ratio construction, and we address the problem of testing\nlarge size samples under computational and storage resource constraints. We\npropose to perform parallel NPLM routines over batches of the data, and to\ncombine them by locally aggregating over the data-to-reference density ratios\nlearnt by each batch. The resulting data hypothesis defining the\nlikelihood-ratio test is thus shared over the batches, and complies with the\nassumption that the expected rate of new physical processes is time invariant.\nWe show that this method outperforms the simple sum of the independent tests\nrun over the batches, and can recover, or even surpass, the sensitivity of the\nsingle test run over the full data. Beside the significant advantage for the\noffline application of NPLM to large size samples, the proposed approach offers\nnew prospects toward the use of NPLM to construct anomaly-aware summary\nstatistics in quasi-online data streaming scenarios.","PeriodicalId":501065,"journal":{"name":"arXiv - PHYS - Data Analysis, Statistics and Probability","volume":"51 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Data Analysis, Statistics and Probability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Signal-agnostic data exploration based on machine learning could unveil very
subtle statistical deviations of collider data from the expected Standard Model
of particle physics. The beneficial impact of a large training sample on
machine learning solutions motivates the exploration of increasingly large and
inclusive samples of acquired data with resource efficient computational
methods. In this work we consider the New Physics Learning Machine (NPLM), a
multivariate goodness-of-fit test built on the Neyman-Pearson
maximum-likelihood-ratio construction, and we address the problem of testing
large size samples under computational and storage resource constraints. We
propose to perform parallel NPLM routines over batches of the data, and to
combine them by locally aggregating over the data-to-reference density ratios
learnt by each batch. The resulting data hypothesis defining the
likelihood-ratio test is thus shared over the batches, and complies with the
assumption that the expected rate of new physical processes is time invariant.
We show that this method outperforms the simple sum of the independent tests
run over the batches, and can recover, or even surpass, the sensitivity of the
single test run over the full data. Beside the significant advantage for the
offline application of NPLM to large size samples, the proposed approach offers
new prospects toward the use of NPLM to construct anomaly-aware summary
statistics in quasi-online data streaming scenarios.