How to Use HappyHorse 1.0: A Practical Guide with 12 Hands-On Examples
HappyHorse 1.0 is the open-source #1-ranked AI video model on the Artificial Analysis Video Arena, released by Alibaba's Taotian Future Life Lab on April 9, 2026 and made available through fal as the official API partner on April 27. It is the recommended default for any new short-form video workflow: joint video and audio in one forward pass, native 1080p output, multilingual lip-sync in seven languages, and full commercial-use rights for self-hosted deployments.
This guide is a practical, prompt-first walkthrough — how to phrase the prompt, what to ask for, and twelve real examples you can copy directly. The twelve prompts below all follow the same recipe. Copy them, swap the subject for your own, and ship.
The Prompt Recipe That Actually Works
HappyHorse 1.0 rewards detail. Unlike most current text-to-video models that quietly sand off specifics when prompts get long, HappyHorse holds together for dense prompts as long as the detail is concrete. The trick is to make the prompt do real structural work, not pile on style tokens.
Every example below follows the same six-element recipe — in this order:
- Scene and timing — where and when ("a sunlit Tokyo cafe in late morning", "a rocky coastal beach at golden hour, eight seconds").
- Subject — who or what is in the frame, including scale, pose, gaze, and action ("a man in his early 30s in a charcoal grey wool coat, ordering at the counter").
- Action and motion — what moves and how much ("slowly turns her head toward the window, a soft smile begins to form, no more than 5% camera push-in").
- Camera language — movement, angle, shot size ("slow low-angle orbit, full 180-degree arc, eye-height roughly 30 cm above the ground").
- Lighting and texture — direction, quality, time of day ("soft window light from camera-left, shallow depth of field, 50mm lens").
- Audio intent — voice-over, dialogue, foley, ambient bed ("realistic ocean wave bed, a single distant gull, no music"). For dialogue, place the literal line in quotes.
Two extra rules that matter on HappyHorse 1.0 specifically: place spoken dialogue in quotes inside the prompt to engage the lip-sync pathway, and explicitly name the audio bed (foley, ambience, music intent) — the model handles audio better when you tell it what to make instead of leaving it to infer from the visuals. Generic style tokens like "cinematic, masterpiece, 4K, ultra-detailed" mostly get ignored. Spend that prompt budget on motion and audio instead.
Example 1 — Talking Head with English Lip-Sync
Talking heads are the most identity-sensitive category in video generation. The trick with HappyHorse 1.0 is to engage the joint audio-video pathway by placing the spoken line in quotes — the model treats quoted text as the dialogue track and aligns mouth shapes to the phonemes inside the same denoising step. Use Pro mode and a 16:9 frame for clean lip-sync.
Prompt
A medium close-up of a man in his early 30s, short dark hair, light grey crewneck sweater, sitting in a sunlit home office. He looks directly at the camera and says calmly in clear English: "I think the simplest version of the idea is also the strongest." Soft diffused window light from camera-left, a slightly out-of-focus bookshelf in the background. Eye-level shot, 50mm lens look, shallow depth of field. Subtle natural ambient room tone, no music. 1080p, 16:9, five seconds.
Why this works: the prompt names the medium (50mm, shallow depth of field), the lighting direction (camera-left, soft diffused), the literal dialogue in quotes, and an explicit audio intent ("subtle ambient room tone, no music"). Those four levers together are what produce production-quality lip-sync on the first try.
Example 2 — Multilingual Talking Head (Japanese)
HappyHorse 1.0 ships native lip-sync support in seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. The prompting recipe stays the same — place the literal line in quotes, name the language explicitly, and let the model align mouth shapes to the phonemes of that specific language. For Japanese, mora boundaries differ from English syllable timing, and naming the language up front nudges the model into the correct phoneme set.
Prompt
A medium close-up of a young Japanese woman in her late 20s, sitting at a small wooden table in a sunlit Tokyo cafe. She looks toward the camera and says clearly in Japanese: "今日は晴れていて、気持ちがいいですね。" Soft window light from camera-left, a single ceramic cup on the table, quiet espresso machine and distant chatter in the background. Eye-level shot, 50mm lens, shallow depth of field. 1080p, 16:9, five seconds.
Tip: when prompting in a non-English language, write the dialogue line natively rather than romanizing it. For Japanese, Mandarin, Korean, and Cantonese, use the native script in quotes. The lip-sync pathway is trained on the actual phonemes, not on transliterated approximations.
Example 3 — Image-to-Video Animation
Image-to-video is HappyHorse 1.0's strongest arena category — Elo 1392, the highest score on the leaderboard at the time of writing. The pattern that works reliably: state explicitly what should move, how much it should move, and what should stay stable. Avoid asking for camera movement larger than ~5% on a still — small motion respects the source composition; large motion tends to break it.
Prompt
Animate this image: the woman gently turns her head toward the window and a soft, almost imperceptible smile begins to form. A few strands of hair shift in a light breeze. Slow push-in from the camera, no more than 5%. Match the lighting, color temperature, and depth of field of the original photograph exactly. Add quiet ambient room tone — a single distant bird call, no music. 1080p, 16:9, five seconds.
Example 4 — Vertical 9:16 Short-Form for Social
HappyHorse 1.0 supports 9:16 vertical natively — the model is not generating 16:9 and cropping; it is composing for vertical viewing. For TikTok, Reels, and YouTube Shorts, request 9:16 explicitly and ask for the subject to occupy the lower or middle third so the upper third is left clear for caption overlays. Std mode is plenty for short-form — Pro mode rarely shows a meaningful quality lift for 9:16 thumb-zone content.
Prompt
A vertical 9:16 cooking shorts clip. A pair of hands cracks two eggs into a hot cast-iron pan over a gas burner, the egg whites sizzle on contact. Camera is locked off in a slight overhead angle, eggs in the lower two-thirds, room for a caption overlay in the upper third. Warm kitchen light, slight steam rising. Foley: realistic egg-crack and a clear sizzling pan, no music. 1080p, 9:16, six seconds.
Example 5 — Cinematic Product Spot with Synchronized Audio
Joint audio + video is HappyHorse 1.0's signature capability. For brand spots, the recipe is to define the visual scene first, then specify three audio elements: voice-over (in quotes), foley (specific named sounds), and ambient bed. The single forward pass produces the audio aligned to the visual events without a separate sound design session.
Prompt
A premium skincare brand spot. Open on a clean white serum bottle with a gold dropper cap resting on a marble surface, soft golden-hour light from camera-left, dried botanicals scattered around the bottle. Slow push-in from a medium shot to a tight close-up on the dropper. A single soft female voice-over in English: "Refined by nature." A subtle ambient piano underneath, a soft ceramic tap as the dropper cap is lifted, no other dialogue. 1080p, 16:9, eight seconds.
Example 6 — Explicit Camera Movement Control
HappyHorse 1.0 follows explicit camera-language tokens. The cookbook of moves that work reliably: push-in, pull-back, dolly, tracking shot, orbit, handheld follow, locked-off, slow pan, tilt up, tilt down, crane up, crane down. Combine a movement with an angle (low-angle, eye-level, overhead, Dutch) and a shot size (wide, medium, close-up, extreme close-up) to nail the cinematographic register.
Prompt
A static subject: a vintage red motorcycle parked on wet asphalt at night, a single overhead street lamp casting a hard light pool around it. The camera performs a slow, smooth, low-angle orbit around the motorcycle, full 180-degree arc from frame-right to frame-left, eye height roughly 30 cm above the ground. Reflections in the wet asphalt track the orbit consistently. Audio: distant city ambience, a single soft engine tick, no music. 1080p, 16:9, eight seconds.
Tip: when you ask for an orbit or a long tracking shot, also specify the start and end positions ("from frame-right to frame-left, full 180-degree arc"). HappyHorse 1.0 follows trajectories more reliably when both endpoints are explicit instead of being left for the model to infer.
Example 7 — Multi-Shot Sequence with Character Consistency
The fal text-to-video endpoint supports a multi-shot mode: up to five shots, each up to twelve seconds, individually prompted, generated in one call with character identity preserved across shots. The pattern that works: describe the protagonist once at the top, then list each shot as a numbered beat with its own framing, action, and audio. Keep total duration under sixty seconds.
Prompt
A four-shot coffee shop sequence. Same character throughout: a tall man in his mid-30s, dark hair, a charcoal grey wool coat over a cream sweater. Shot 1 (3 s): wide establishing shot, he enters a small bright coffee shop on a rainy morning, water on the windows. Shot 2 (3 s): medium shot at the counter, he orders, light steam from an espresso machine. Shot 3 (3 s): over-the-shoulder of the barista pouring milk into a small cup, latte art forming. Shot 4 (3 s): tight beauty close-up on the finished cup placed on the counter, his hand entering the frame to pick it up. Keep the man's face, hair, coat, and sweater identical across all four shots. 1080p, 16:9, twelve seconds total.
Example 8 — Reference-to-Video for Brand Product Placement
The reference-to-video endpoint accepts a text prompt plus reference images. The intent is to keep each reference distinct — a real product, a specific character, a style frame — rather than blending them together. For brand work, this is the cleanest way to drop a real product into a generated scene without the typical drift on logo placement, label color, or shape. Up to four image inputs per element, up to three elements per task.
Prompt
Reference image 1: a specific dark-roast coffee bag with a clean kraft-paper finish and a visible brand mark. Generate a scene where the same bag stands on a wooden kitchen counter at sunrise, golden light spilling through a window from camera-left, soft steam rising from a mug placed beside it. Slow push-in from a medium shot to a tight close-up on the brand mark. Keep the bag's shape, color, and brand mark exactly as in the reference. Audio: soft morning ambience, a single quiet milk-pour sound. 1080p, 16:9, six seconds.
Example 9 — Compositing with @element Tokens
In the fal API, the `happyhorse_elements` field lets you define named assets — each with a name, a description, and 2–4 input images — and reference them in the prompt as `@element_name`. This is HappyHorse 1.0's explicit composition layer: it lets you pin a specific dog, a specific person, or a specific product into the scene with no naming ambiguity. Maximum three elements per task.
Prompt
Elements: @element_dog: a specific golden retriever (4 reference images of the same dog from different angles). @element_ball: a small bright orange tennis ball (2 reference images). Scene prompt: A sunlit park lawn at golden hour, soft warm side light, slight wind in the grass. @element_dog runs from frame-right toward the camera and catches @element_ball mid-air, slows down, and trots back the way he came. Handheld follow, eye-level, slight motion blur on the legs. Audio: light wind, a soft "thunk" as the ball is caught, distant park ambience, no music. 1080p, 16:9, six seconds.
Example 10 — Natural-Language Video Edit
The fal video-edit endpoint takes an input clip and a text instruction and applies targeted edits. The pattern that works best: state explicitly what to change AND what to preserve. Use phrasing like "change only X" + "keep everything else the same" + repeat the preserve list. For audio edits, also state whether the audio bed should be preserved or regenerated. Up to five reference images can be passed for content guidance.
Prompt
Take this input clip of a runner on a coastal road and change the time of day from flat midday light to warm golden-hour light just before sunset. Add a long subject shadow falling toward the lower-left corner. Keep the runner, their pace, the framing, the camera move, and all ambient sound exactly the same. Do not change the clothing colors, the road, or the ocean horizon. 1080p, 16:9, six seconds.
Example 11 — Stylized Image-to-Video Transform
Stylized transforms keep the subject and composition of an input image but change the surface medium — for example, animating a photograph in a watercolor painterly style or pushing a flat illustration into anime cel-shaded motion. Drop in the source, then describe what must stay consistent (subject, pose, composition) and what must change (medium, palette, brushwork). Add a hard "no extra elements" constraint to prevent the model from inventing peripheral details.
Prompt
Use this flat-color illustration as the input. Re-render it as a loose, warm-toned watercolor painting in motion. Preserve the original character's pose, proportions, and composition exactly. Add subtle wet-on-wet wash bleeds, visible paper grain, and gentle pigment shifts in shadow areas. Slow camera drift from left to right, no zoom. No dialogue, soft natural ambient sound only. 1080p, 16:9, five seconds.
Example 12 — Bilingual Prompts (English and Mandarin)
HappyHorse 1.0 was trained by a team operating in both Chinese and English, so prompt comprehension is genuinely bilingual. Mandarin prompts produce results comparable in quality to English prompts. The implication for production: write prompts in whichever language better captures the cultural specificity of the scene. For Beijing hutong shots, Cantonese street markets, or Japanese izakaya scenes, native-language prompts often produce more physically accurate detail than translated equivalents.
Prompt
一条传统的北京胡同清晨场景。一位老人骑着一辆旧二八自行车,缓缓从画面右侧驶向左侧。两侧是灰砖灰瓦的四合院围墙,几只鸽子从屋顶飞过。柔和的清晨阳光从画面左上方洒下,地面上有淡淡的晨雾。镜头位置低角度,缓慢横移跟随老人。声音:远处的鸽哨声、自行车链条的轻响、几声晨练的吆喝。1080p,16:9,八秒。
For bilingual workflows, keep the language consistent across all six prompt elements. Mixing English camera language with Mandarin scene description occasionally produces interpretation drift on the camera move. Pick one language per prompt and stay with it.
Choosing Mode, Resolution, and Aspect Ratio by Use Case
HappyHorse 1.0 exposes two quality modes — Std (the distilled 8-step student) and Pro (the extended denoising schedule) — and supports clip lengths from 3 to 15 seconds in five aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4). Std is the right default for short-form, ideation, and any clip that will go through downstream review. Reach for Pro when fidelity is the bottleneck — typically dialogue scenes, brand hero shots, and multi-shot sequences. Audio is on by default and adds about 50% to credit usage.
| Workflow | Recommended Resolution | Recommended Mode | Aspect | Notes |
|---|---|---|---|---|
| Short-form social draft | 720p | Std | 9:16 | Fastest. Audio off for the cheapest credits. |
| English / multilingual talking head | 1080p | Pro | 16:9 | Lip-sync needs Pro for cleanest sync. |
| Cinematic product spot | 1080p | Pro | 16:9 | Joint audio is the headline value here. |
| Image-to-video animation | 1080p | Pro | Match input | I2V is the model's #1 arena category. |
| Vertical TikTok / Reels | 1080p | Std | 9:16 | Std is plenty for thumb-zone content. |
| Multi-shot scene (up to 5 shots) | 1080p | Pro | 16:9 | Character consistency benefits from Pro. |
| Reference-to-video product placement | 1080p | Pro | 16:9 or 9:16 | Brand-mark accuracy benefits from Pro. |
| Stylized I2V transform | 1080p | Pro | Match input | Painterly motion details need Pro. |
| Natural-language video edit | Match input | Pro | Match input | Edits preserve input fidelity automatically. |
| Real-world foley / ambient bed | 720p–1080p | Std | 16:9 | Audio bed quality is similar across modes. |
| Element compositing (@elements) | 1080p | Pro | 16:9 | Pro recovers fine detail on named assets. |
| Internal review / batch ideation | 720p | Std | 16:9 | Cheapest path; audio off. |
Common Pitfalls and How to Avoid Them
- Generic style boosters ("8K, ultra-detailed, cinematic, masterpiece") are mostly ignored. They are leftover patterns from earlier diffusion models. Spend that prompt budget on motion, audio, and constraints instead.
- Forgetting to quote dialogue. Without quotes, the model paraphrases the line and lip-sync degrades. With quotes, the line is treated as a fixed dialogue track and aligned to phonemes inside the same denoising step.
- Asking for too much camera motion on a still input. Image-to-video works best with small, named motion ("slow push-in, no more than 5%"). Large camera moves on a still tend to break the source composition.
- Skipping the audio intent. HappyHorse 1.0 generates audio by default — if you do not name the bed, the model picks one. State the audio explicitly: "no music, soft ambient room tone" or "subtle ambient piano underneath, no other dialogue."
- Crowding multi-shot mode. The fal multi-shot endpoint caps at five shots, twelve seconds each. More than that and the model treats it as one continuous clip. For longer narratives, chain calls in a downstream pipeline.
- Mixing prompt languages mid-prompt. Pick English or Mandarin and stay with it across all six elements. Mixing languages occasionally produces interpretation drift on the camera move.
- Running on consumer GPUs without quantization. The full 15B model needs ≥48 GB VRAM to run without quality loss. RTX 4090 deployments require 4-bit quantization, which visibly degrades motion stability.
- Trusting "official" mirror domains. The team has publicly warned that several look-alike Happy Horse domains are phishing attempts. Pin to the GitHub repository, the official Hugging Face hub, or vetted API partners like fal.
Wrap-Up: A Default Prompt Template
If you take one thing from this guide, take the prompt template. It works for almost every use case in the examples above:
Scene and timing → Subject (with scale and gaze) → Action and motion → Camera language (movement, angle, shot size) → Lighting and texture (direction, quality) → Audio intent (voice-over in quotes, foley, ambient bed) → Resolution, aspect ratio, duration.
Start at Std mode, 720p, 9:16 for fast ideation. Run two generations to calibrate the prompt, then move to Pro mode and 1080p for the final asset. For refinements, edit the existing clip with a natural-language instruction rather than regenerating from scratch — the latter is the single biggest source of drift in production work.