{"title":"绘画与音乐的桥梁 -- 通过绘画探索基于情感的音乐创作","authors":"Tanisha Hisariya, Huan Zhang, Jinhua Liang","doi":"arxiv-2409.07827","DOIUrl":null,"url":null,"abstract":"Rapid advancements in artificial intelligence have significantly enhanced\ngenerative tasks involving music and images, employing both unimodal and\nmultimodal approaches. This research develops a model capable of generating\nmusic that resonates with the emotions depicted in visual arts, integrating\nemotion labeling, image captioning, and language models to transform visual\ninputs into musical compositions. Addressing the scarcity of aligned art and\nmusic data, we curated the Emotion Painting Music Dataset, pairing paintings\nwith corresponding music for effective training and evaluation. Our dual-stage\nframework converts images to text descriptions of emotional content and then\ntransforms these descriptions into music, facilitating efficient learning with\nminimal data. Performance is evaluated using metrics such as Fr\\'echet Audio\nDistance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL\ndivergence, with audio-emotion text similarity confirmed by the pre-trained\nCLAP model to demonstrate high alignment between generated music and text. This\nsynthesis tool bridges visual art and music, enhancing accessibility for the\nvisually impaired and opening avenues in educational and therapeutic\napplications by providing enriched multi-sensory experiences.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings\",\"authors\":\"Tanisha Hisariya, Huan Zhang, Jinhua Liang\",\"doi\":\"arxiv-2409.07827\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Rapid advancements in artificial intelligence have significantly enhanced\\ngenerative tasks involving music and images, employing both unimodal and\\nmultimodal approaches. This research develops a model capable of generating\\nmusic that resonates with the emotions depicted in visual arts, integrating\\nemotion labeling, image captioning, and language models to transform visual\\ninputs into musical compositions. Addressing the scarcity of aligned art and\\nmusic data, we curated the Emotion Painting Music Dataset, pairing paintings\\nwith corresponding music for effective training and evaluation. Our dual-stage\\nframework converts images to text descriptions of emotional content and then\\ntransforms these descriptions into music, facilitating efficient learning with\\nminimal data. Performance is evaluated using metrics such as Fr\\\\'echet Audio\\nDistance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL\\ndivergence, with audio-emotion text similarity confirmed by the pre-trained\\nCLAP model to demonstrate high alignment between generated music and text. This\\nsynthesis tool bridges visual art and music, enhancing accessibility for the\\nvisually impaired and opening avenues in educational and therapeutic\\napplications by providing enriched multi-sensory experiences.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07827\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings
Rapid advancements in artificial intelligence have significantly enhanced
generative tasks involving music and images, employing both unimodal and
multimodal approaches. This research develops a model capable of generating
music that resonates with the emotions depicted in visual arts, integrating
emotion labeling, image captioning, and language models to transform visual
inputs into musical compositions. Addressing the scarcity of aligned art and
music data, we curated the Emotion Painting Music Dataset, pairing paintings
with corresponding music for effective training and evaluation. Our dual-stage
framework converts images to text descriptions of emotional content and then
transforms these descriptions into music, facilitating efficient learning with
minimal data. Performance is evaluated using metrics such as Fr\'echet Audio
Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL
divergence, with audio-emotion text similarity confirmed by the pre-trained
CLAP model to demonstrate high alignment between generated music and text. This
synthesis tool bridges visual art and music, enhancing accessibility for the
visually impaired and opening avenues in educational and therapeutic
applications by providing enriched multi-sensory experiences.