Experience Alibaba's HappyHorse 1.0 on Cuty.ai — the #1-ranked AI video model on the Artificial Analysis Video Arena. Generate native 1080p video with synchronized audio in a single forward pass, native lip-sync across seven languages, and cinematic quality from text or image prompts. Try it free!
Discover what makes HappyHorse 1.0 exceptional
HappyHorse 1.0 is built on a 40-layer unified single-stream Transformer that denoises text, image, video, and audio tokens together in one sequence — no separate Foley model, no post-processing pass. Speech, footfalls, and ambient sound emerge from the same step as the visuals, so dialogue and on-screen action align at the phoneme level.
Phoneme-level lip-sync ships natively for English, Mandarin, Cantonese, Japanese, Korean, German, and French. Mouth shapes are produced inside the same denoising step as the rest of the frame — not bolted on by a face-region post-fitter — making HappyHorse 1.0 one of the only top-ranked video models with multilingual dialogue ready for production at launch.
True 1080p generation — not upscaled — across 16:9, 9:16, 1:1, 4:3, and 3:4, so the same scene is composed correctly for cinematic, vertical, square, and portrait delivery. Clip lengths from 3 to 15 seconds, with roughly 38 seconds of inference per 5-second 1080p clip on a single NVIDIA H100 thanks to an 8-step DMD-2 distilled denoising path.
HappyHorse 1.0 took #1 in both Text-to-Video (Elo 1333) and Image-to-Video (Elo 1392) on the Artificial Analysis Video Arena — a blind human-preference benchmark — within days of its anonymous April 7, 2026 debut. The 60-point T2V Elo gap over the previous leader is the largest single-release leap on the leaderboard since it launched.
Everything you need to know about HappyHorse 1.0
HappyHorse 1.0 is the first AI video model from Alibaba's Taotian Future Life Lab — a 15-billion-parameter unified Transformer that jointly generates video and synchronized audio from text or image prompts at native 1080p. After debuting anonymously on the Artificial Analysis Video Arena around April 7, 2026 and immediately taking #1 in both Text-to-Video and Image-to-Video, Alibaba publicly claimed authorship on April 10, 2026.
HappyHorse 1.0 was built inside the Future Life Lab at Alibaba's Taotian Group, part of the ATH (Alibaba Token Hub) AI Innovation Unit. The technical lead is Zhang Di — a fifteen-year industry veteran who served as Vice President at Kuaishou and was the technical architect of Kling AI before rejoining Alibaba in late 2025 to lead the lab.
Unlike most video models that bolt on audio as a separate post-step, HappyHorse 1.0 puts text, image, video, and audio tokens into a single token sequence and denoises them together in one 40-layer single-stream Transformer. Speech, sound effects, and ambient audio synchronize naturally to the visuals because they are produced in the same forward pass.
HappyHorse 1.0 ships native lip-sync in seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. Mouth shapes are aligned to phonemes in the same denoising step as the rest of the frame. Other languages still produce reasonable mouth movement, but accuracy at the phoneme level is below the supported set.
HappyHorse 1.0 generates native 1080p video (with 720p available) in clip lengths from 3 to 15 seconds. Aspect ratios include 16:9, 9:16, 1:1, 4:3, and 3:4 — covering cinematic widescreen, mobile vertical, square social, and portrait formats. The 8-step DMD-2 distillation pipeline takes around 38 seconds per 5-second 1080p clip on a single NVIDIA H100.
HappyHorse 1.0 holds #1 in both Text-to-Video and Image-to-Video on the Artificial Analysis Video Arena, ahead of Kling, Veo, and Seedance under blind human-preference voting. It is also unique among top-tier models for jointly generating video and audio in a single forward pass, native lip-sync across seven languages, and native 1080p cinematic output. With audio on, HappyHorse 1.0 currently ranks #2 by a small margin.
You can try HappyHorse 1.0 on Cuty.ai with our free trial credits — both text-to-video and image-to-video are live in the studio. For extensive use and access to all premium features, including longer clips and the Pro mode for hero shots and dialogue-heavy content, we offer various subscription plans.
Start generating amazing content with our powerful AI models. Try it free today!