Insights10 min read

The Evolution of AI Video Generation: From 2023 to Now

How AI video went from warped, glitching memes to cinematic production tools in three years — and what the next chapter looks like for creators.

Published April 24, 2026

The Evolution of AI Video Generation: From 2023 to Now

In March 2023, someone on Reddit posted an AI-generated video of Will Smith eating spaghetti.

Smith's face warped between expressions that didn't quite belong to him. His hands turned into rubbery appendages. The noodles floated like they were obeying a different law of physics than the rest of the scene. The bowl teleported. The lighting drifted. The whole clip felt like a dream someone was trying — and failing — to remember.

That clip became a meme. Then it became a benchmark. Then, somehow, it became the timeline of an entire industry.

Three years later, Will Smith eating spaghetti is no longer a punchline. It's a yardstick — and the difference between the 2023 version and the 2026 version tells the whole story of where AI video has been, and where it's going.

A visual journey from warped 2023 AI clips to cinematic 2026 frames

See What's Possible Today on Gendia

Phase 1 — The Glitch Era (2022–early 2023)

Before Will Smith ever picked up a fork, AI video looked like static.

Early text-to-video systems — research projects, mostly, running on tools like ModelScope and CogVideo — could produce a few seconds of footage from a text prompt. The results were, charitably, surreal. Subjects melted into backgrounds. Faces had no internal consistency from frame to frame. Objects appeared and disappeared mid-shot, obeying a logic that wasn't quite physical and wasn't quite cinematic.

The technology was real. The output was unusable.

But something had been proven — text could become video. The pixels on screen had a relationship to the words in the prompt. That relationship was loose, chaotic, dreamlike. But it existed. And once it existed, the clock started ticking.

Phase 2 — The Spaghetti Era (2023)

The Will Smith clip didn't go viral because it was good. It went viral because it was specifically, hilariously, fascinatingly bad.

What made it perfect as a meme was also what made it perfect as a benchmark. Eating spaghetti requires every hard problem in AI video at once: a recognizable human face, consistent identity across motion, hands manipulating a tool, noodles obeying physics, a bowl staying put, lighting that doesn't drift. The 2023 version got all of it wrong, in interesting ways.

This was the moment AI video entered the public conversation. Not because anyone thought the technology was ready — but because everyone could now see, in one specific test, exactly how far it had to go.

The rest of 2023 was foundational rather than flashy. Runway's Gen-2 launched. Pika's first model arrived. Stability AI released Stable Video Diffusion. None of them solved the spaghetti problem. All of them established that AI video was now a real category — not a research demo.

Phase 3 — The Sora Shock (Early 2024)

In February 2024, OpenAI released a research demo called Sora.

The clips were a minute long. They had coherent characters. The lighting stayed consistent. The physics mostly worked. A wooly mammoth walked through snow. A drone shot floated over Tokyo at golden hour. A cat woke up its sleeping owner.

The internet stopped scrolling.

Whatever you thought AI video could become in 2030 — Sora collapsed that timeline into one demo reel. Suddenly the question wasn't "is this technology possible?" It was "how fast is it coming?"

The actual product wasn't released for months. The economics were brutal — generation costs ran into dollars per second. Most creators couldn't access it. But the proof of concept reset the industry's ambition overnight. Every other lab that was working on video — Google, Kuaishou, ByteDance, Alibaba, MiniMax, xAI — accelerated.

Try the Models That Came After Sora on Gendia

Phase 4 — The Multi-Player Race (Late 2024 – Early 2025)

The post-Sora wave was the moment AI video stopped being a single-product story.

Kuaishou's Kling 1.0 arrived from China, with motion quality that rivaled Sora's at a fraction of the cost. Runway shipped Gen-3, then Gen-3 Alpha Turbo, focusing on professional editing workflows. Luma released Dream Machine. MiniMax launched Hailuo. Pika released its 1.5 series with object-level effects.

For the first time, creators had a real choice. Each model had personality. Each had specialties. Each had failure modes. And the clip lengths started growing — five seconds became ten, then fifteen.

The industry also had its first real reckoning with cost. Generating a 10-second clip on a premium model could cost a few dollars. Iteration was painful. Most creators rationed their generations like film stock.

Quality was racing. Affordability was lagging behind.

Phase 5 — The Audio Breakthrough (Mid–Late 2025)

The next leap wasn't in pixels. It was in sound.

For most of AI video's history, audio was a separate problem. You generated a clip silently. You added voice in post. You added music in post. You added foley in post. The handoffs were where production time got eaten alive.

Veo 3 and Veo 3.1 changed that. Native audio generation — dialogue, ambient sound, music, sound effects — generated in the same pass as the video. Synchronized. Lip-synced. Lighting-aware.

Kling 2.6 followed. Then Hailuo 2.3. Then Seedance.

By late 2025, native audio had shifted from differentiator to baseline expectation. If your model didn't generate audio, it wasn't competing in the top tier.

This was also when the Will Smith spaghetti benchmark started getting genuinely shocking results. Veo 3 produced a version where the face stayed stable, the noodles behaved like noodles, and the audio synced with the chewing. Was it perfect? No. But it was no longer a meme. It was just a video of someone eating spaghetti.

Phase 6 — The Multi-Shot Era (Early 2026)

Then came the breakthrough nobody had seen coming.

In February 2026, three models launched within weeks of each other — Kling 3.0, Sora 2 Pro, and Seedance 1.5 Pro. Each took a different angle on the same hard problem: making AI video work as storytelling, not just clip generation.

Kling 3.0 introduced multi-shot sequences — generating up to 180 seconds with a consistent character across different camera angles, all from a single prompt.

Seedance 2.0 followed shortly after with unified audio-video joint generation, twelve-file multimodal input, and phoneme-level lip sync in eight languages.

Veo 3.1 doubled down on cinematic quality, native 4K output, and 48kHz audio.

The Will Smith test was no longer a test. The latest version, generated with Kling 3.0, didn't just show Smith eating spaghetti — it showed him having a conversation with a child while eating spaghetti, with synced audio, stable identity, and physical objects that behaved like physical objects.

The spaghetti benchmark had been beaten. The new question was: what happens now that the model can act?

A visual representation of the model lineup defining 2026 — multiple cinematic frames in a flowing arrangement

Phase 7 — The Multi-Model Workflow (Now)

Here's where we are in mid-2026.

There is no single best AI video model. There are five or six, each with a distinct strength, and the creators producing the best work in 2026 don't pick one — they route between several.

Seedance 2.0 for narrative and product fidelity
Kling 3.0 for long-form and human motion
Veo 3.1 for cinematic 4K hero shots
Hailuo 2.3 for stylized character work and rapid iteration
Wan 2.6 for budget volume work
Grok for loose, fast creative drafts

The frontier didn't consolidate. It diversified.

This is what makes 2026 structurally different from any previous era of AI video. Five years ago, you waited for one model to get good. Three years ago, you picked a winner and learned its quirks. Now, the workflow is the model. The skill isn't mastering one tool — it's knowing which tool gets which shot.

Use Every Frontier Model on Gendia

What Actually Got Solved

Looking back at the three-year arc, four hard problems went from impossible to solved:

Identity Consistency

In 2023, a face couldn't survive five seconds of motion. In 2026, characters stay recognizably themselves across multiple shots, multiple camera angles, and multiple model generations.

Physical Coherence

In 2023, gravity was a suggestion. In 2026, objects fall, water splashes, fabric drapes, and limbs move within the constraints of bodies.

Native Audio

In 2023, AI video was silent. In 2026, top-tier models generate dialogue, sound effects, music, and ambient audio in the same pass as the visual — synced, lip-matched, environmentally aware.

Multi-Shot Storytelling

In 2023, you got one shot. In 2026, you can describe a sequence and get back a multi-scene narrative with consistent characters, lighting, and tone.

These weren't predictable wins. Each one required architectural breakthroughs — diffusion improvements, joint multimodal training, attention mechanisms that survive temporal coherence. The fact that all four landed in three years is closer to historically unusual than historically expected.

What Hasn't Been Solved Yet

For balance — and because every honest history acknowledges what's still ahead — there are still hard problems.

True long-form coherence. Most models still degrade past 30 seconds. The 180-second sequences in Kling 3.0 are an exception, and even those drift if you push them.

Fine-grained directing. You can describe a scene. You can't yet say "the actor should pause for a half-beat after the second word, then look down." The director-level control that filmmakers want isn't quite here.

Editing existing footage. Generating new clips is solved. Modifying real clips — text-instructed video editing, scene-level changes to existing footage — is still developing.

Cost at scale. Cinematic-grade clips still cost meaningful credits. Bulk production for full-length content remains expensive.

These are the open frontiers for 2026 and 2027. They're also the places the next wave of breakthroughs will land.

What This Means for Creators

The lesson of the three-year arc isn't "the technology is finally ready." That framing misses the point.

The lesson is — the technology is now changing fast enough that the skill that matters isn't choosing the right tool. It's learning quickly. The creator who picked up Sora in 2024, then routed to Kling and Hailuo in 2025, then to Seedance and Veo 3.1 in 2026, kept their edge by adapting. The creator who tried to master one model and stick with it watched the frontier move past them every six months.

The platforms that win in 2026 aren't the ones with the best single model. They're the ones that let creators switch without friction — same canvas, same credits, same workflow, new model.

That's the structural insight Gendia was built around.

You can't predict which model will be the best in October 2026. You can predict that it'll be different from April. The platform that gives you all of them, on day one, in one tab — that's the one that doesn't require you to bet on a single horse.

Gendia's unified canvas with multiple AI video models accessible at once

Final Thoughts

Will Smith ate spaghetti in 2023, and the internet laughed.

Three years later, he ate spaghetti again, and the internet stopped being able to tell whether the clip was real.

That arc — from meme to mirror — happened in 36 months. It will not slow down. The next 36 months will compress just as much progress, in directions we can't fully see yet.

The right response isn't to predict where it lands. It's to be on the platform that adapts as it lands.

Don't watch the next chapter from the sidelines.

Start Creating on Gendia

#AIVideo
#VideoGeneration
#Sora
#Veo3
#Kling
#Seedance
#GenerativeAI
#AIVideoHistory

Insights

How Small and Medium Businesses Use Gendia: 10 Real Use Cases

Guides

I Put Any Outfit on Any Model — Then Made It Spin. No Photographer, No Studio, No Model.