Machine learning-based intrusion detection systems (ML-IDS) for in-vehicle networks require diverse, high-quality datasets that are scarce because of privacy and data collection challenges. Collecting data in the real world often faces challenges, such as a lack of detailed attack scenarios and significant resource requirements. This survey examines synthetic data generation (SDG) as a solution and systematically reviews SDG methods, ML-IDS models, and their intersection in automotive security, which has not been addressed in prior surveys. We introduce a quantitative evaluation framework and apply it to synthetic and real datasets, such as SynCAN (Synthetic Controller Area Network), CAN-MIRGU (CAN Multi-Information Record Generating Unit) and Real ORNL (Oak Ridge National Laboratory) Automotive Dynamometer (ROAD) dataset. The results reveal critical limitations, since current synthetic approaches show reduced identifier coverage and unrealistic temporal patterns. Additionally, spatial network topology analysis reveals that synthetic datasets lack the hierarchical hub-and-spoke communication structures and functional subsystem coupling characteristic of real vehicular networks. Through a comprehensive analysis of more than 50 papers published in the time period from 2018 to 2025, we identified five research gaps,including temporal fidelity preservation, real-time constraints, cross-vehicle generalisation, attack diversity limitations, and quality validation requirements. Although SDG promises to address data scarcity and enable complex attack scenario simulations, current methods inadequately model authentic vehicular communications. We provide guidelines for developing temporally aware generation models and validation frameworks for practical deployment.
扫码关注我们
求助内容:
应助结果提醒方式:
