The present study investigates the feasibility of inducing visual perceptual learning on a peripheral, global direction discrimination and integration task in virtual reality, and tests whether audio-visual multisensory training induces faster or greater visual learning than unisensory visual training. Seventeen participants completed a 10-day training experiment wherein they repeatedly performed a 4-alternative, combined visual global-motion and direction discrimination task at 10° azimuth/elevation in a virtual environment. A visual-only group of 8 participants was trained using a unimodal visual stimulus. An audio-visual group of 9 participants underwent training whereby the visual stimulus was always paired with a pulsed, white-noise auditory cue that simulated auditory motion in a direction consistent with the horizontal component of the visual motion stimulus. Our results reveal that, for both groups, learning occurred and transferred to untrained locations. For the AV group, there was an additional performance benefit to training from the AV cue to horizontal motion. This benefit extended into the unisensory post-test, where the auditory cue was removed. However, this benefit did not generalize spatially to previously untrained areas. This spatial specificity suggests that AV learning may have occurred at a lower level in the visual pathways, compared to visual-only learning.