Modern multimodal learning often requires handling heterogeneous data types whose structures and information densities differ substantially. To address this challenge in the context of metabolic dysfunction-associated fatty liver disease (MAFLD) prediction, we propose an information-density-aware multimodal framework (NID-Net). Instead of relying on simple concatenation or shallow fusion, the model processes each modality using methods that align with its structural characteristics. Structured indicators with high information density are first processed by an XGBoost module optimized via Lagrange remainder correction, which enhances the nonlinearity of the loss landscape and improves robustness to data sparsity and imbalance. Meanwhile, tongue images with relatively low information density are encoded using a region-enhanced Swin Transformer, where adaptive regional biases guide the model toward informative local representations. The resulting modality-specific embeddings are fused within a Mixture-of-Experts (MoE) architecture, enabling selective specialization and nonlinear decision boundaries across modalities. Extensive experiments on real-world medical datasets demonstrate that NID-Net not only surpasses existing multimodal fusion approaches in predictive performance but also provides interpretable insights into cross-modal feature interactions. This work highlights the fundamental role of nonlinear design in achieving efficient, balanced, and explainable multimodal prediction systems.
扫码关注我们
求助内容:
应助结果提醒方式:
