If you introduce yourself a cloaked dog floating through the clouds, or an astronaut riding a horse on Mars, you may think you are experiencing a fever dream.
But these surreal images exist outside of a sleepy daze: you can now access them on your computer.
They were created by Meta’s world-class algorithms that can turn any text into a (slightly) realistic video. Last month, Meta used these surreal clips to introduce the world to its AI text-to-video generator, Make-A-Video.
Just a few days later, Google showed two AI video generators: Imagen Video and Phenaki. These models are designed to convert text descriptions into short video clips. The longest clips are from Phenaki and last up to several minutes.
While these modern marvels are not yet publicly available, they could change the way we make art forever. So far, the models have received both criticism and praise from AI and art experts.
“The big commercial advances are pretty amazing, even for experts,” says David Bau, a computer scientist at Northeastern University.
How Models Make Videos From Scratch
This isn’t the first case of AI-powered video manipulation: In recent years, for example, several startups have figured out how to adjust lip movements to sync them with audio. It’s also possible to swap people’s faces, for example to trick viewers into believing a celebrity was in a movie they weren’t actually involved in.
And now the new text-to-video models can fuse unrelated concepts (like knitting, music, and pandas) and create a stunning end product (like a panda knitting on a couch while bobbing its head to the music).
“This ability to create novel compositions from such a wide range of visual concepts is new,” says Bau. “That’s what’s so amazing.”
Make-A-Video and Imagen Video belong to a group of machine learning models called diffusion models.
Engineers train a diffusion model by showing it an annotated image or video and asking the model to sharpen that image. The model learns to predict into which images certain words on the screen could be translated.
To train a model at the scale needed for Make-A-Video and Imagen Video, engineers examined hundreds of millions of images.
Finally, the model was able to handle prompts. Like a sculptor slowly carving away a chunk of rock, a diffusion model shapes an image or video over hundreds or even thousands of passes—cleaning pixels here, coloring a shape there, reshaping an edge elsewhere. In the end, what remains is a reasonably convincing clip.
Phenaki’s creators also showed him millions of images and videos with accompanying text – but Phenaki learned which words in the text were important. That means it can take, say, a paragraph-sized narrative, break it down into a series of events, and turn that series into a film of whatever length Phenaki deems appropriate.
fall flat (for now)
Of course, the process is far from perfect. For one thing, text-to-video models are only as good as their (massive) training data sets, and it’s difficult to separate data from the biases of the people who created them.
The creators of Imagen Video acknowledged that their model is capable of producing racist or sexist content. In fact, Google won’t be releasing it to the public until the company addresses “several important security and ethical challenges,” according to the preprint.
Imagen Video may create videos that are “fake, hateful, explicit, or harmful”.
The creators noted that Imagen Video can create videos that are “fake, hateful, explicit, or harmful” and that it is currently difficult to detect and filter out this type of content.
Meta has remained on the mother issue, but a spokesperson said the company “will continue to explore ways to further refine and mitigate potential risks.”
And these models are unlikely to impress real artists. Compared to the work of real-life artists, the selected computer-generated demonstration clips fall flat. But that could change.
Real world applications
As text-to-video AI improves, some experts believe the stock video industry could particularly benefit. “Why would you pay a lot of money…to license an image or a video when you could just create it on the spot?” says Henry Ajder, an AI scientist and consultant who researches deepfakes.
Hollywood could also benefit from generative models. Filmmakers could use them, for example, to imagine what an actor might look like in a certain role, or to plan scenes before shooting them.
Eventually, text-to-video could excel at spitting out specific products — say, banner ads or simple animations in video games. For artists, this could pose a major dilemma.
After all, some art jobs are being outsourced to AI-generated illustrations, says Julian Merkle, video game concept artist at Beffio Studio.
“The combination of those two things seems like a fragile and undesirable situation.”
In addition, the convergence of art and AI raises unanswered questions. Who actually owns the rights to an AI-generated work of art? And what happens when they create media “in the style” of existing artists, which some models are already doing?
“There was no consensus before the rise of AI and there probably won’t be one after, although people seem to attack AI faster than artists,” says Merkle. “I think that’s a problem with our copyright system.”
On the other hand, text-to-video could put more power in the hands of individuals. Imagine a single person who wants to develop a video game.
It’s an uphill battle these days – independent game developers have to be skilled enough to create art, animation, cutscenes and text. But in the future, one person could do all of this using generative models
That’s all far on the horizon. Currently, large video generators are essentially locked jewels displayed in corporate showcases. That means researchers can’t investigate the mysterious workings of the models.
“To me, the combination of these two things seems like a fragile and undesirable situation,” says Bau.
An uncertain future
But the engineers between other AI art generators tend to be far more transparent. The text generator GPT-3, developed by the OpenAI laboratory from San Francisco, can write a poem or a text summary, among other things. Although not free to use, GPT-3 spawned a wave of open-source language models.
And earlier this year, the DALL-E 2 image generator met its open-source match with Stable Diffusion.
“For me, moderation is the biggest challenge with these tools.”
So anyone with a powerful enough computer (a good gaming rig will do) could download and tinker with Stable Diffusion or some video generation equivalent. However, this removes the filters that prevent the proprietary models from spawning objectionable content.
This could open the floodgates for more creative and dangerous deepfakes. Bullies could produce footage of attacking classmates, or bullies could produce explicit clips of their ex-boyfriends. “If everyone can use it, it’s not just celebrities that get attacked,” says Ajder. “For me, moderation is the biggest challenge with these tools.”
So while we may soon be entering an age of easily accessible deepfakes, we probably have a few more years to prevent dangerous consequences. “We’re a long way from hyper-realistic, not authentic, content on video,” says Ajder.