Miggi Zwicklbauer, W. Lamm, Martin Gordon, Konstantinos Apostolidis, Basil Philipp, V. Mezaris
This paper presents a method to interactively create a new Sandmannchen story. We built an application which is deployed on a smart speaker, interacts with a user, selects appropriate segments from a database of Sandmannchen episodes and combines them to generate a new story that is compatible with the user requests. The underlying video analysis technologies are presented and evaluated. We additionally showcase example results from using the complete application, as a proof of concept.
{"title":"Video Analysis for Interactive Story Creation: The Sandmännchen Showcase","authors":"Miggi Zwicklbauer, W. Lamm, Martin Gordon, Konstantinos Apostolidis, Basil Philipp, V. Mezaris","doi":"10.1145/3422839.3423061","DOIUrl":"https://doi.org/10.1145/3422839.3423061","url":null,"abstract":"This paper presents a method to interactively create a new Sandmannchen story. We built an application which is deployed on a smart speaker, interacts with a user, selects appropriate segments from a database of Sandmannchen episodes and combines them to generate a new story that is compatible with the user requests. The underlying video analysis technologies are presented and evaluated. We additionally showcase example results from using the complete application, as a proof of concept.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132727825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syeda Maryam Fatima, Marina Shehzad, Syed Sami Murtuza, S. S. Raza
This paper demonstrates a CNN based neural style transfer on audio dataset to make storytelling a personalized experience by asking users to record a few sentences that are used to mimic their voice. User audios are converted to spectrograms, the style of which is transferred to the spectrogram of a base voice narrating the story. This neural style transfer is similar to the style transfer on images. This approach stands out as it needs a small dataset and therefore, also takes less time to train the model. This project is intended specifically for children who prefer digital interaction and are also increasingly leaving behind the storytelling culture and for working parents who are not able to spend enough time with their children. By using a parent's initial recording to narrate a given story, it is designed to serve as a conjunction between storytelling and screen-time to incorporate children's interest through the implicit ethical themes of the stories, connecting children to their loved ones simultaneously ensuring an innocuous and meaningful learning experience.
{"title":"Neural Style Transfer Based Voice Mimicking for Personalized Audio Stories","authors":"Syeda Maryam Fatima, Marina Shehzad, Syed Sami Murtuza, S. S. Raza","doi":"10.1145/3422839.3423063","DOIUrl":"https://doi.org/10.1145/3422839.3423063","url":null,"abstract":"This paper demonstrates a CNN based neural style transfer on audio dataset to make storytelling a personalized experience by asking users to record a few sentences that are used to mimic their voice. User audios are converted to spectrograms, the style of which is transferred to the spectrogram of a base voice narrating the story. This neural style transfer is similar to the style transfer on images. This approach stands out as it needs a small dataset and therefore, also takes less time to train the model. This project is intended specifically for children who prefer digital interaction and are also increasingly leaving behind the storytelling culture and for working parents who are not able to spend enough time with their children. By using a parent's initial recording to narrate a given story, it is designed to serve as a conjunction between storytelling and screen-time to incorporate children's interest through the implicit ethical themes of the stories, connecting children to their loved ones simultaneously ensuring an innocuous and meaningful learning experience.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129940240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Humans perform intelligent tasks by productively leveraging relevant information from numerous sensory and experiential inputs, and recent scientific and hardware advances have made it increasingly possible for machines to attempt this as well. However, improved resource availability does not automatically give rise to humanlike performance in complex tasks [1]. In this talk, I discuss recent work towards three tasks that benefit from an elegant synthesis of linguistic and visual input: visual storytelling, visual question answering (VQA), and affective content analysis. I focus primarily on visual storytelling, a burgeoning task with the goal of generating coherent, sensible narratives for sequences of input images [2]. I analyze recent work in this area, and then introduce a novel visual storytelling approach that employs a hierarchical context-based network, with a co-attention mechanism that jointly attends to patterns in visual (image) and linguistic (description) input. Following this, I describe ongoing work in VQA, another inherently multimodal task with the goal of producing accurate, sensible answers to questions about images. I explore a formulation in which the VQA model generates unconstrained, free-form text, providing preliminary evidence that harnessing the linguistic patterns latent in language models results in competitive task performance [3]. Finally, I introduce some intriguing new work that investigates the utility of linguistic patterns in a task that is not inherently multimodal: analyzing the affective content of images. I close by suggesting some exciting future directions for each of these tasks as they pertain to multimodal media analysis.
{"title":"And, Action! Towards Leveraging Multimodal Patterns for Storytelling and Content Analysis","authors":"Natalie Parde","doi":"10.1145/3422839.3423060","DOIUrl":"https://doi.org/10.1145/3422839.3423060","url":null,"abstract":"Humans perform intelligent tasks by productively leveraging relevant information from numerous sensory and experiential inputs, and recent scientific and hardware advances have made it increasingly possible for machines to attempt this as well. However, improved resource availability does not automatically give rise to humanlike performance in complex tasks [1]. In this talk, I discuss recent work towards three tasks that benefit from an elegant synthesis of linguistic and visual input: visual storytelling, visual question answering (VQA), and affective content analysis. I focus primarily on visual storytelling, a burgeoning task with the goal of generating coherent, sensible narratives for sequences of input images [2]. I analyze recent work in this area, and then introduce a novel visual storytelling approach that employs a hierarchical context-based network, with a co-attention mechanism that jointly attends to patterns in visual (image) and linguistic (description) input. Following this, I describe ongoing work in VQA, another inherently multimodal task with the goal of producing accurate, sensible answers to questions about images. I explore a formulation in which the VQA model generates unconstrained, free-form text, providing preliminary evidence that harnessing the linguistic patterns latent in language models results in competitive task performance [3]. Finally, I introduce some intriguing new work that investigates the utility of linguistic patterns in a task that is not inherently multimodal: analyzing the affective content of images. I close by suggesting some exciting future directions for each of these tasks as they pertain to multimodal media analysis.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130408925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 1: Video Analytics and Storytelling","authors":"V. Mezaris","doi":"10.1145/3429509","DOIUrl":"https://doi.org/10.1145/3429509","url":null,"abstract":"","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130934480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote & Invited Talks","authors":"Raphael Troncy","doi":"10.1145/3429508","DOIUrl":"https://doi.org/10.1145/3429508","url":null,"abstract":"","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124532180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present a Bidirectional LSTM neural network with a Conditional Random Field layer on top, which utilizes word, character and morph embeddings in order to perform named entity recognition on various Finnish datasets. To overcome the lack of annotated training corpora that arises when dealing with low-resource languages like Finnish, we tried a knowledge transfer technique to transfer tags from Estonian dataset. On the human annotated in-domain Digitoday dataset, out system achieved F1 score of 84.73. On the out-of-domain Wikipedia set we got F1 score of 67.66. In order to see how well the system performs on speech data, we used two datasets containing automatic speech recognition outputs. Since we do not have true labels for those datasets, we used a rule-based system to annotate them and used those annotations as reference labels. On the first dataset which contains Finnish parliament sessions we obtained F1 score of 42.09 and on the second one which contains talks from Yle Pressiklubi we obtained F1 score of 74.54.
{"title":"Named Entity Recognition for Spoken Finnish","authors":"Dejan Porjazovski, Juho Leinonen, M. Kurimo","doi":"10.1145/3422839.3423066","DOIUrl":"https://doi.org/10.1145/3422839.3423066","url":null,"abstract":"In this paper we present a Bidirectional LSTM neural network with a Conditional Random Field layer on top, which utilizes word, character and morph embeddings in order to perform named entity recognition on various Finnish datasets. To overcome the lack of annotated training corpora that arises when dealing with low-resource languages like Finnish, we tried a knowledge transfer technique to transfer tags from Estonian dataset. On the human annotated in-domain Digitoday dataset, out system achieved F1 score of 84.73. On the out-of-domain Wikipedia set we got F1 score of 67.66. In order to see how well the system performs on speech data, we used two datasets containing automatic speech recognition outputs. Since we do not have true labels for those datasets, we used a rule-based system to annotate them and used those annotations as reference labels. On the first dataset which contains Finnish parliament sessions we obtained F1 score of 42.09 and on the second one which contains talks from Yle Pressiklubi we obtained F1 score of 74.54.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126729978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","authors":"","doi":"10.1145/3422839","DOIUrl":"https://doi.org/10.1145/3422839","url":null,"abstract":"","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122249481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
TV broadcasters and other organizations with online media collections which wish to extend the reach of and engagement with their media assets conduct digital marketing activities. The marketing success depends on the relevance of the topics of the media content to the audience, which is made even more difficult when planning future marketing activities as one needs to know the topics that the future audience will be interested in. This paper presents the innovative application of AI based predictive analytics to identify the topics that will be more popular among future audiences and its use in a digital content marketing strategy of media organisations.
{"title":"Predicting Your Future Audience's Popular Topics to Optimize TV Content Marketing Success","authors":"L. Nixon","doi":"10.1145/3422839.3423062","DOIUrl":"https://doi.org/10.1145/3422839.3423062","url":null,"abstract":"TV broadcasters and other organizations with online media collections which wish to extend the reach of and engagement with their media assets conduct digital marketing activities. The marketing success depends on the relevance of the topics of the media content to the audience, which is made even more difficult when planning future marketing activities as one needs to know the topics that the future audience will be interested in. This paper presents the innovative application of AI based predictive analytics to identify the topics that will be more popular among future audiences and its use in a digital content marketing strategy of media organisations.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125678491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 2: Video Annotation and Summarization","authors":"Jorma T. Laaksonen","doi":"10.1145/3429510","DOIUrl":"https://doi.org/10.1145/3429510","url":null,"abstract":"","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133623506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yashaswi Rauthan, Vatsala Singh, Rishabh Agrawal, Satej Kadlay, N. Pedanekar, Shirish S. Karande, Manasi Malik, Iaphi Tariang
Crisis situations often require authorities to convey important messages to a large population of varying demographics. An example of such a message is maintain a distance of 6 ft from others in times of the present COVID-19 crisis. In this paper, we propose a method to programmatically place such messages in existing entertainment media as overlays at semantically relevant locations. For this purpose, we use generic semantic annotations on the media and subsequent spatio-temporal querying on these annotations to find candidate locations for message placement. We then propose choosing the final locations optimally using parameters such as spacing of messages, length of the messages and confidence of query results. We present preliminary results for optimal placement of messages in popular entertainment media.
{"title":"Avoid Crowding in the Battlefield: Semantic Placement of Social Messages in Entertainment Programs","authors":"Yashaswi Rauthan, Vatsala Singh, Rishabh Agrawal, Satej Kadlay, N. Pedanekar, Shirish S. Karande, Manasi Malik, Iaphi Tariang","doi":"10.1145/3422839.3423065","DOIUrl":"https://doi.org/10.1145/3422839.3423065","url":null,"abstract":"Crisis situations often require authorities to convey important messages to a large population of varying demographics. An example of such a message is maintain a distance of 6 ft from others in times of the present COVID-19 crisis. In this paper, we propose a method to programmatically place such messages in existing entertainment media as overlays at semantically relevant locations. For this purpose, we use generic semantic annotations on the media and subsequent spatio-temporal querying on these annotations to find candidate locations for message placement. We then propose choosing the final locations optimally using parameters such as spacing of messages, length of the messages and confidence of query results. We present preliminary results for optimal placement of messages in popular entertainment media.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131552246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}