Anshuman Tripathi, Gursharanjit Kaur, Abhirup Datta and Suman Majumdar
{"title":"Comparing sampling techniques to chart parameter space of 21 cm global signal with Artificial Neural Networks","authors":"Anshuman Tripathi, Gursharanjit Kaur, Abhirup Datta and Suman Majumdar","doi":"10.1088/1475-7516/2024/10/041","DOIUrl":null,"url":null,"abstract":"Understanding the first billion years of the universe requires studying two critical epochs: the Epoch of Reionization (EoR) and Cosmic Dawn (CD). However, due to limited data, the properties of the Intergalactic Medium (IGM) during these periods remain poorly understood, leading to a vast parameter space for the global 21cm signal. Training an Artificial Neural Network (ANN) with a narrowly defined parameter space can result in biased inferences. To mitigate this, the training dataset must be uniformly drawn from the entire parameter space to cover all possible signal realizations. However, drawing all possible realizations is computationally challenging, necessitating the sampling of a representative subset of this space. This study aims to identify optimal sampling techniques for the extensive dimensionality and volume of the 21cm signal parameter space. The optimally sampled training set will be used to train the ANN to infer from the global signal experiment. We investigate three sampling techniques: random, Latin hypercube (stratified), and Hammersley sequence (quasi-Monte Carlo) sampling, and compare their outcomes. Our findings reveal that sufficient samples must be drawn for robust and accurate ANN model training, regardless of the sampling technique employed. The required sample size depends primarily on two factors: the complexity of the data and the number of free parameters. More free parameters necessitate drawing more realizations. Among the sampling techniques utilized, we find that ANN models trained with Hammersley Sequence sampling demonstrate greater robustness compared to those trained with Latin hypercube and Random sampling.","PeriodicalId":15445,"journal":{"name":"Journal of Cosmology and Astroparticle Physics","volume":"31 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cosmology and Astroparticle Physics","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1088/1475-7516/2024/10/041","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ASTRONOMY & ASTROPHYSICS","Score":null,"Total":0}
引用次数: 0
Abstract
Understanding the first billion years of the universe requires studying two critical epochs: the Epoch of Reionization (EoR) and Cosmic Dawn (CD). However, due to limited data, the properties of the Intergalactic Medium (IGM) during these periods remain poorly understood, leading to a vast parameter space for the global 21cm signal. Training an Artificial Neural Network (ANN) with a narrowly defined parameter space can result in biased inferences. To mitigate this, the training dataset must be uniformly drawn from the entire parameter space to cover all possible signal realizations. However, drawing all possible realizations is computationally challenging, necessitating the sampling of a representative subset of this space. This study aims to identify optimal sampling techniques for the extensive dimensionality and volume of the 21cm signal parameter space. The optimally sampled training set will be used to train the ANN to infer from the global signal experiment. We investigate three sampling techniques: random, Latin hypercube (stratified), and Hammersley sequence (quasi-Monte Carlo) sampling, and compare their outcomes. Our findings reveal that sufficient samples must be drawn for robust and accurate ANN model training, regardless of the sampling technique employed. The required sample size depends primarily on two factors: the complexity of the data and the number of free parameters. More free parameters necessitate drawing more realizations. Among the sampling techniques utilized, we find that ANN models trained with Hammersley Sequence sampling demonstrate greater robustness compared to those trained with Latin hypercube and Random sampling.
要了解宇宙最初的十亿年,需要研究两个关键的纪元:再电离纪元(EoR)和宇宙黎明纪元(CD)。然而,由于数据有限,人们对这两个时期的星系际介质(IGM)特性仍然知之甚少,导致全球 21cm 信号的参数空间十分巨大。用定义狭窄的参数空间来训练人工神经网络(ANN)可能会导致推论出现偏差。为了减轻这种情况,训练数据集必须从整个参数空间中统一抽取,以涵盖所有可能的信号变现。然而,绘制所有可能的实现情况在计算上具有挑战性,因此必须对该空间的代表性子集进行采样。本研究旨在针对 21 厘米信号参数空间的广泛维度和容量确定最佳采样技术。优化采样后的训练集将用于训练 ANN,以便从全局信号实验中进行推断。我们研究了三种抽样技术:随机抽样、拉丁超立方(分层)抽样和哈默斯利序列(准蒙特卡洛)抽样,并比较了它们的结果。我们的研究结果表明,无论采用哪种抽样技术,都必须抽取足够的样本才能进行稳健、准确的 ANN 模型训练。所需样本量主要取决于两个因素:数据的复杂性和自由参数的数量。自由参数越多,就需要抽取更多的真实值。我们发现,在所使用的抽样技术中,与使用拉丁超立方和随机抽样技术训练的模型相比,使用哈默斯利序列抽样技术训练的 ANN 模型具有更强的鲁棒性。
期刊介绍:
Journal of Cosmology and Astroparticle Physics (JCAP) encompasses theoretical, observational and experimental areas as well as computation and simulation. The journal covers the latest developments in the theory of all fundamental interactions and their cosmological implications (e.g. M-theory and cosmology, brane cosmology). JCAP''s coverage also includes topics such as formation, dynamics and clustering of galaxies, pre-galactic star formation, x-ray astronomy, radio astronomy, gravitational lensing, active galactic nuclei, intergalactic and interstellar matter.