One of the challenges of sentiment analysis and emotion recognition is how to effectively fuse the multimodal inputs. The transformer-based models have achieved great success in applications of multimodal sentiment analysis and emotion recognition recently. However, the transformer-based model often neglects the coherence of human emotion due to its parallel structure. Additionally, a low-rank bottleneck created by multi- attention-head causes an inadequate fitting ability of models. To tackle these issues, a Deep Spatiotemporal Interaction Network (DSIN) is proposed in this study. It consists of two main components, i.e., a cross-modal transformer with a cross-talking attention module and a hierarchically temporal fusion module, where the cross-modal transformer is used to model the spatial interactions between different modalities and the hierarchically temporal fusion network is utilized to model the temporal coherence of emotion. Therefore, the DSIN can model the spatiotemporal interactions of multimodal inputs by incorporating the time-dependency into the parallel structure of transformer and decrease the redundancy of embedded features by implanting their spatiotemporal interactions into a hybrid memory network in a hierarchical manner. The experimental results on two benchmark datasets indicate that DSIN achieves superior performance compared with the state-of-the-art models, and some useful insights are derived from the results.