However what happens when you try to go beyond single image generation and apply them to long-form content like movie scripts, stories, and podcasts?
For example, let's say you have a script you are trying to illustrate.
A reasonable but naive strategy would be to feed every scene one-at-a-time to the models. This completely fails in practice for a number of reasons.
In this blog post, we will discuss 6 vision-language grounding challenges you will encounter when applying these generative systems at scale.
1) Character ambiguity
Consider a scene from a movie script where the protagonist Teddy (28M) is in the bathroom, practicing his engagement proposal speech.
If you wanted to illustrate this scene, feeding it into Midjourney would produce the following:
Or if you fed the scene description into DALL·E, you would generate:
The issue is that character names can be misleading; these models interpret the semantics of the text tokens by leveraging the language priors they have acquired through their training data.
In other words, because Teddy shows up often with a generation of an actual teddy bear in the training corpus, the model will fall back to this illustration.
The remedy here is to incorporate the additional guidance of the character context into the model's generation, namely that we are dealing with an adult male and not a teddy bear.
2) Character consistency
Across any two frames, existing generative models are very bad at maintaining consistent physical descriptions for the same characters.
Here are two depictions of a character, Dillon Murphy, in a script across two different scenes. We include the accompanying prompts provided to Stable Diffusion.
Notice that, regardless of how specific we are in the character description provided to the model (28 year old man with long brown hair, blue eyes, a square jaw, and a slight beard), it is still generating different outputs.
This is problematic when we need the character to have identical appearances across shots, which is often the case in long-form content.
3) Breaking a complex scene into visualizable pieces
Here's an action description from a movie script:
There's a lot going on here. What should a generative model focus on to make for the best shot?
Here's how Stable Diffusion interprets this passage:
This is really off-the-mark because there's just too much going on for the model to know what to depict.
Research has shown that all existing generative models struggle to interpret basic language phenomena like counting, spatial relationships, directed actions, and compositionality.
This problem becomes especially more pronounced as you try to apply generative models to denser pieces of text like passages from books.
4) Carrying over physical context across generations
Generative models have no good mechanisms for persisting physical context effectively.
In one scene from a script, we find ourselves in a doctor's office where a patient is being told a negative prognosis.
The very next action in the scene has the following description:
If you interpret this action in isolation you have no context about the physical setting. Who's involved in the scene? How they are sitting? What's the tone? How can a model know what to generate here?
Here's Stable Diffusion's interpretation:
Because the models have not carried over the setting information about the doctor's office, they just zero in on generating something with a grin.
5) Capturing implicit emotional context
Here's another action from the script:
Is he happy? Sad? Nostalgic? Pained?
Stable Diffusion generates this depiction:
There's very little emotional depth in this illustration.
As they stand, current generative models have no ability to emotionally ground characters in the real world given language subtext.
This problem is even more prominent if a character undergoes emotional transformation across shots in a scene.
6) Stylistic consistency
Here's Stable Diffusion's interpretation of two consecutive scenes both depicting action in the style of the film WALL-E.
Are these really the same visual style?
If you've seen WALL-E you'll notice that while the Stable Diffusion humans are animated, they don't match the style of the film's humans.
They are close but not the same, which introduces a visual uncanny valley.
In general existing generative models are not able to persist consistent stylistic aesthetics across generations even if you try to prompt them appropriately.
While Midjourney, Stable Diffusion, and DALL·E are able to produce expressive single image generations, they struggle when it comes to more complex vision-language grounding.
Contextual understanding is crucial since even if text-to-image models produce the most beautiful illustrations, they can't read your mind.
That's why at Storia AI we're investing heavily in generative AI models and workflows to truly enable rich visual power in long-form content.
At Storia, we're building the future of AI-driven video solutions with some exciting new products coming out soon. If you get excited about AI and video, we want to hear from you!