Recent breakthroughs in generative AI have markedly elevated the realism and controllability of synthetic media. In the visual modality, long-context attention mechanisms and diffusion-style refinements now deliver videos with superior temporal consistency, spatial coherence, and high-resolution detail. These techniques underpin an expanding set of applications ranging from text-guided storyboarding and animation to engineering visualization and virtual prototyping. In the audio modality, token-based representations combined with hierarchical decoding enable the direct production of faithful speech, music, and ambient sound from textual prompts, powering rapid voice-over creation, personalized music, and immersive soundscapes. The frontier is shifting toward unified audio–visual pipelines that synchronize imagery with dialog, sound effects, and ambience, promising end-to-end tooling for a wide variety of applications such as education, simulation, entertainment, and accessible content production. This review surveys these advances across modalities and outlines future research directions focused on improving generation efficiency, coherence, and controllability across modalities.
{"title":"Artificial Intelligence in Multimedia Content Generation: A Review of Audio and Video Synthesis Techniques","authors":"Charles Ding, Rohan Bhowmik","doi":"10.1002/jsid.2111","DOIUrl":"https://doi.org/10.1002/jsid.2111","url":null,"abstract":"<p>Recent breakthroughs in generative AI have markedly elevated the realism and controllability of synthetic media. In the visual modality, long-context attention mechanisms and diffusion-style refinements now deliver videos with superior temporal consistency, spatial coherence, and high-resolution detail. These techniques underpin an expanding set of applications ranging from text-guided storyboarding and animation to engineering visualization and virtual prototyping. In the audio modality, token-based representations combined with hierarchical decoding enable the direct production of faithful speech, music, and ambient sound from textual prompts, powering rapid voice-over creation, personalized music, and immersive soundscapes. The frontier is shifting toward unified audio–visual pipelines that synchronize imagery with dialog, sound effects, and ambience, promising end-to-end tooling for a wide variety of applications such as education, simulation, entertainment, and accessible content production. This review surveys these advances across modalities and outlines future research directions focused on improving generation efficiency, coherence, and controllability across modalities.</p>","PeriodicalId":49979,"journal":{"name":"Journal of the Society for Information Display","volume":"34 2","pages":"49-67"},"PeriodicalIF":2.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sid.onlinelibrary.wiley.com/doi/epdf/10.1002/jsid.2111","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146139231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}