This paper describes a three-dimensional (3D) modeling method for sequentially and spatially understanding situations in unknown environments from an image sequence acquired from a camera. The proposed method chronologically divides the image sequence into sub-image sequences by the number of images, generates local 3D models from the sub-image sequences by the Structure from Motion and Multi-View Stereo (SfM–MVS), and integrates the models. Images in each sub-image sequence partially overlap with previous and subsequent sub-image sequences. The local 3D models are integrated into a 3D model using transformation parameters computed from camera trajectories estimated by the SfM–MVS. In our experiment, we quantitatively compared the quality of integrated models with a 3D model generated from all images in a batch and the computational time to obtain these models using three real data sets acquired from a camera. Consequently, the proposed method can generate a quality integrated model that is compared with a 3D model using all images in a batch by the SfM–MVS and reduce the computational time.