a16z: The "Pixar" of a New Era, How Will AI Integrate Movies and Games?

Deep Tide TechFlow
2024-08-30 17:53:20
Collection
The Pixar of the next century will not be born through traditional movies or animations, but through interactive videos.

Original Title: “The Next Generation Pixar: How AI will Merge Film & Games”

Author: Jonathan Lai

Compiled by: Deep Tide TechFlow

Over the past century, technological revolutions have given rise to many of our favorite stories. Take the 1930s as an example, when Disney invented the multiplane camera and produced the first synchronized sound full-color animation. This technological breakthrough led to the birth of the groundbreaking animated film "Snow White and the Seven Dwarfs."

In the 1940s, Marvel and DC Comics rose to prominence, heralded as the "Golden Age of Comics," thanks to the widespread use of the four-color rotary press and lithographic printing technology, which allowed comics to be printed on a large scale. The limitations of this technology—low resolution, limited tonal range, and halftone printing on cheap newsprint—formed the iconic "pulp" look that we still recognize today.

Similarly, Pixar was uniquely positioned in the 1980s to leverage new technological platforms—computers and 3D graphics. Co-founder Edwin Catmull was an early researcher at the NYIT Computer Graphics Lab and Lucasfilm, paving the way for foundational CGI concepts, eventually leading to the release of the first fully computer-generated animated feature film, "Toy Story." Pixar's graphics rendering suite RenderMan has since been used in over 500 films.

In each wave of technological change, early prototypes that were initially novelties gradually evolve into new formats for deep storytelling, led by generation after generation of new creators. Today, we believe the next Pixar is about to be born. Generative AI is driving a fundamental shift in creative storytelling, enabling a new generation of human creators to tell stories in entirely new ways.

Specifically, we believe the next Pixar will not emerge through traditional film or animation, but through interactive video. This new narrative format will blur the lines between video games and television/film—merging deep storytelling with audience agency and "play," opening up a vast new market.

Gaming: The Frontier of Modern Narrative

Today, two major waves are emerging that could accelerate the formation of the next generation of narrative companies:

  1. The shift of consumers towards interactive media (as opposed to linear/passive media, i.e., television/film)

  2. Technological advancements driven by generative AI

Over the past 30 years, we have seen a deepening shift among consumers, with games and interactive media becoming increasingly popular among each generation. For Gen Z and younger generations, gaming is now their preferred way to spend leisure time, surpassing television and film. In 2019, Netflix CEO Reed Hastings stated in a letter to shareholders that "our competition is with Fortnite (and we often lose to it) more than HBO." For most households, the question is "What are we playing?" rather than "What are we watching?"

While television, film, and books still tell compelling stories, many of the most innovative and successful new stories are now being told in games. For example, "Harry Potter." The open-world role-playing game "Hogwarts Legacy" immerses players in the experience of being a new student at Hogwarts like never before. This game was a bestseller in 2023, generating over 1 billion dollars in revenue at launch, surpassing the box office of all "Harry Potter" films except for the last one, "Harry Potter and the Deathly Hallows – Part 2" (1.03 billion dollars).

Game intellectual properties (IPs) have also seen tremendous success in television and film adaptations recently. Naughty Dog's "The Last of Us" became HBO Max's most-watched series in 2023, averaging 32 million viewers per episode. The "Super Mario Bros." movie had a box office of 1.4 billion dollars, creating the biggest opening weekend for an animated film globally. Additionally, there are critically acclaimed series like "Fallout," Paramount's Halo series, Tom Holland's Tomb Raider film, and Michael Bay's Skibidi Toilet film—among many others.

One key reason interactive media is so powerful is that active participation helps build intimacy with the story or universe. An hour of playing a game captures attention far more than an hour of passively watching television. Many games are also social, incorporating multiplayer mechanisms into their core design. The most memorable stories are often those we co-create and share with friends and family.

Audiences interact continuously with IPs across multiple media (watching, gaming, creating, sharing), making stories not just entertainment but a part of personal identity. A magical transformation occurs when a person evolves from being a mere "Harry Potter viewer" to a "dedicated Potterhead," the latter being more enduring and building identity and community around what was once a solitary activity.

Overall, while some of the greatest stories in history have been told in linear media, looking ahead, games and interactive media will become the places for storytelling in the future—thus we believe the most important narrative companies of the next century will be born here.

Interactive Video: The Fusion of Narrative and Gaming

Given the dominance of gaming in culture, we believe the next Pixar will emerge through a media format that combines narrative with gaming. One promising format we see is interactive video.

First, what is interactive video, and how does it differ from video games? In video games, developers pre-load a set of resources into the game engine. For example, in "Super Mario Bros.," artists designed the Mario character, trees, and backgrounds. Programmers set Mario to jump 50 pixels when the player presses the "A" button. The jump frames utilize a traditional graphics pipeline for rendering. This leads to a highly deterministic and computational game architecture, where developers have complete control.

In contrast, interactive video relies entirely on neural networks to generate frames in real-time. No resources need to be uploaded or created beyond creative prompts (which can be text or representative images). Real-time AI image models receive player input (e.g., the "up" button) and probabilistically infer the next generated game frame.

The potential of interactive video lies in merging the accessibility of television and film with the depth of narrative, while incorporating the dynamics and player-driven systems of video games. Everyone knows how to watch television and follow a linear story. By adding player input-driven real-time generated video, we can create personalized and infinite gaming experiences—potentially allowing media works to engage fans for thousands of hours. Blizzard's "World of Warcraft" has surpassed 20 years and still retains around 7 million subscribers today.

Interactive video also offers multiple consumption modes—viewers can easily enjoy content like watching a TV show or actively engage in gaming on mobile devices or controllers at other times. Allowing fans to experience their favorite IP universes in as many ways as possible is at the core of transmedia storytelling, which helps enhance intimacy with the IP.

Over the past decade, many storytellers have attempted to realize the vision of interactive video. An early breakthrough was Telltale's The Walking Dead—an interactive experience based on Robert Kirkman's comic series, where players watch animated scenes unfold but make choices at critical moments through dialogue and quick-time events. These choices—such as deciding which character to save during a zombie attack—create personalized story variations, making each gaming experience unique. "The Walking Dead" launched in 2012 and achieved tremendous success—winning multiple Game of the Year awards and selling over 28 million copies to date.

In 2017, Netflix also entered the interactive video space—starting with the animated work Cat Burglar and eventually releasing the critically acclaimed Black Mirror: Bandersnatch, a live-action film where viewers make choices for a young programmer adapting a fantasy book into a video game. Bandersnatch became a holiday phenomenon, attracting a fervent fanbase that created flowcharts to document every possible ending of the film (each possible ending).

However, despite receiving positive reviews, both Bandersnatch and The Walking Dead faced existential crises—the time and cost of manually creating the countless branching stories that define the format were prohibitively expensive. As Telltale expanded to multiple projects, they developed a crunch culture among developers, who complained of "fatigue and burnout." The quality of storytelling suffered—while The Walking Dead initially received a Metacritic score of 89, four years later, when Telltale released one of their biggest IPs, "Batman," it only received a disappointing 64 score. In 2018, Telltale announced bankruptcy, failing to establish a sustainable business model.

For Bandersnatch, the crew shot 250 video segments, including over 5 hours of footage, to explain the film's 5 endings. The budget and production time were reportedly twice that of a standard "Black Mirror" episode, with producers stating that the project's complexity was akin to "simultaneously producing 4 episodes." Ultimately, in 2024, Netflix decided to shut down its entire interactive specials division—turning instead to traditional gaming.

Until now, the content costs of interactive video projects have been linearly related to game time—there's no way around this issue. However, advancements in generative AI models may be key to scaling interactive video.

Generative Models Will Soon Be Fast Enough to Support Interactive Video

Recent advances in image generation model distillation have been remarkable. In 2023, the release of latent consistency models and SDXL Turbo significantly improved the speed and efficiency of image generation, allowing high-resolution rendering in just one step, whereas it previously required 20-30 steps, and costs have dropped by over 30 times. The idea of generating video—a series of consistent images with inter-frame variation—has suddenly become highly feasible.

Earlier this year, OpenAI garnered widespread attention by announcing Sora, a text-to-video model capable of generating videos up to 1 minute long while ensuring visual consistency. Not long after, Luma AI released a faster video model, Dream Machine, capable of generating 120 frames (about 5 seconds of video) in 120 seconds. Luma recently shared that they attracted an astonishing 10 million users in just 7 weeks. Last month, Hedra Labs released Character-1, a character-focused multimodal video model that can generate 60 seconds of video in 90 seconds, showcasing rich human emotions and voiceovers. Runway recently launched Gen-3 Turbo, a model that can render 10 seconds of footage in just 15 seconds.

Today, an aspiring filmmaker can quickly generate a few minutes of 720p HD video content from text prompts or reference images, and can pair it with starting or ending keyframes for added specificity. Runway has also developed a suite of editing tools that provide finer control over diffusion-generated videos, including in-frame camera control, frame interpolation, and motion brushes. Luma and Hedra will also soon launch their respective creator toolkits.

Although production workflows are still in their early stages, we have already encountered several content creators using these tools to tell stories. Resemblance AI created Nexus 1945, a compelling 3-minute alternate history story of World War II, produced using Luma, Midjourney, and Eleven Labs. Independent filmmaker Uncanny Harry collaborated with Hedra to create a cyberpunk short film, and creators have also produced music videos, trailers, travel vlogs, and even fast food burger ads. Since 2022, Runway has hosted an annual AI film festival that selects 10 outstanding AI-produced short films.

It is important to note that there are still some limitations—there remains a significant gap in narrative quality and control between a 2-minute segment generated from prompts and a 2-hour feature produced by a professional team. It is often challenging to generate the desired content based on prompts or images, and even experienced prompt engineers often abandon much of the generated content. AI creator Abel Art reports that generating a coherent 1-minute video requires about 500 videos. Image consistency often begins to fail after one or two minutes of continuous video playback and usually requires manual editing, which is why most generated videos today are limited to about 1 minute in length.

For most professional Hollywood studios, diffusion model-generated videos can be used for storyboarding in pre-production to visualize scenes or characters, but they cannot replace on-site shooting. There are also opportunities to use AI for audio and visual effects processing in post-production, but overall, AI creator toolkits are still in the early stages compared to traditional workflows that have undergone decades of investment.

In the short term, one of the biggest opportunities for generative video lies in developing new media formats, such as interactive video and shorts. Interactive video has already been segmented into short 1-2 minute clips based on player choices, and is often animated or stylized, allowing for lower resolution materials. More importantly, the cost of creating these short videos through diffusion models is more cost-effective than during the Telltale/Bandersnatch era—Abel Art estimates that a 1-minute video from Luma costs about 125 dollars, equivalent to the cost of renting a film crew for a day.

Although the quality of generated videos today may be inconsistent, the popularity of vertical shorts like ReelShort and DramaBox has proven the demand for low-production-value episodic shorts on television. Despite critics complaining about amateur cinematography and formulaic scripts, ReelShort has driven over 30 million downloads and monthly revenues exceeding 10 million dollars, launching thousands of mini-series like “Forbidden Desire: Love of the Alpha”.

The biggest technical hurdle facing interactive video is achieving a fast enough frame generation speed to generate content in real-time. Dream Machine currently generates about 1 frame per second. The minimum acceptable target for modern gaming consoles is a stable 30 FPS, while 60 FPS is the gold standard. With the help of technologies like PAB, this can be increased to 10-20 FPS for certain types of videos, but it is still insufficient speed.

Current State: The Landscape of Interactive Video

Considering the pace of improvements we see in underlying hardware and models, we estimate that we are about 2 years away from commercially viable fully generated interactive video.

Today, we see participants like Microsoft Research and OpenAI making progress in the research field, working on end-to-end foundational models for interactive video. Microsoft's model aims to generate fully "playable worlds" in 3D environments. OpenAI demonstrated a Sora demo that can perform "zero-shot" simulations in Minecraft: "Sora can simultaneously control player actions in Minecraft while rendering the world and its dynamics with high fidelity."

In February 2024, Google DeepMind released its own end-to-end interactive video foundational model, Genie. What sets Genie apart is its latent action model, which infers potential actions between pairs of video frames. Trained on 300,000 hours of platform video, Genie learned to recognize character actions, such as how to jump over obstacles. This latent action model, combined with a video tokenizer, feeds into a dynamic model that predicts the next frame, thus constructing an interactive video.

At the application level, we have already seen some teams exploring new types of interactive video experiences. Many companies are working to create generative films or television shows, designing and developing around the current model limitations. We have also seen some teams incorporating video elements into AI-native game engines.

Ilumine's Latens is developing a "lucid dream simulator," where users walk through dreams while real-time content is generated. This slight delay helps create a surreal experience. Developers from the open-source community Deforum are creating immersive interactive video installations in the real world. Dynamic is developing a simulation engine where users can control robots from a first-person perspective using fully generated video.

In the television and film space, Fable Studio is developing Showrunner, an AI streaming service that allows fans to adapt their own versions of popular shows. Fable's proof-of-concept project, South Park AI, garnered 8 million views when it premiered last summer. Solo Twin and Uncanny Harry are two cutting-edge AI filmmaking studios. Alterverse has created an interactive video role-playing game inspired by D&D, where the community decides what happens next. Late Night Labs is a new top-tier film production company that integrates AI into the creative process. Odyssey is developing a visual storytelling platform powered by four generative models.

As the lines between film and gaming blur, we will see AI-native game engines and tools that empower creators with more control. Series AI has developed the Rho Engine, an end-to-end platform for AI game development, and is collaborating with major IP holders to develop original works. We are also seeing AI creator suites launched by Rosebud AI, Astrocade, and Videogame AI that enable newcomers in programming or art to quickly get started in creating interactive experiences.

These new AI creator suites will create market opportunities for storytelling, enabling a new class of citizen creators to bring their imaginations to life using prompt engineering, visual sketches, and voice recognition.

Who Will Build the Interactive Pixar?

Pixar was able to create an iconic company by leveraging the foundational technological transformations of computers and 3D graphics. Today, a similar wave is occurring in the field of generative AI. However, it is important to remember that Pixar's success largely stemmed from "Toy Story" and the classic animated films created by a world-class storytelling team led by John Lasseter. Human creativity combined with new technology creates the best stories.

Similarly, we believe the next Pixar will need to become a world-class interactive storytelling studio as well as a top technology company. Given the rapid advancements in AI research, creative teams will need to work closely with AI teams to merge narrative and game design with technological innovation. Pixar has a unique team that blends art and technology and has established a partnership with Disney. Today's opportunity lies in a new team that can integrate the disciplines of gaming, film, and AI.

It is crucial to clarify that this will be a massive challenge, not just limited to technology. This team will need to explore new ways for human storytellers to collaborate with AI tools to enhance rather than diminish their imaginations. Additionally, there are many legal and ethical hurdles to address—unless creators can prove ownership of all data used to train models, the legal ownership and copyright protection of AI-generated creative works remain unclear. The compensation of original writers, artists, and producers behind the training data also needs to be resolved.

However, it is also clear today that there is a strong demand for new interactive experiences. In the long run, the next Pixar could not only create interactive stories but also build complete virtual worlds. We have previously explored the potential of endless games—dynamic worlds that merge real-time level generation, personalized narratives, and intelligent agents—similar to the concept of HBO's "Westworld." Interactive video addresses one of the biggest challenges in making "Westworld" a reality—rapidly generating vast amounts of personalized, high-quality interactive content.

One day, with the help of AI, we may open the creative process by building a story world—a fully formed IP world we envision, complete with characters, narrative threads, visuals, etc.—and then generate various media products we wish to provide for audiences or specific contexts. This would be the ultimate evolution of transmedia storytelling, completely blurring the lines of traditional media forms.

Pixar, Disney, and Marvel have all been able to create unforgettable worlds that have become core parts of fan identities. The opportunity for the next interactive Pixar lies in leveraging generative AI to achieve the same goal—creating new story worlds that blur the boundaries of traditional narrative formats, thus creating unprecedented worlds.

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
ChainCatcher Building the Web3 world with innovators