HappyHorse 1.0 Prompting Guide: 12 Techniques for Joint Audio + Video

HappyHorse 1.0
Prompting Guide
Video Generation
Open Source

HappyHorse 1.0 launched on April 9, 2026 as the first open-source AI video model to top the Artificial Analysis Video Arena leaderboard — Elo 1333 in Text-to-Video and Elo 1392 in Image-to-Video, both #1 globally under blind human preference voting. Built by Alibaba's Taotian Future Life Lab and led by Zhang Di (former technical architect of Kling AI), it is a 15-billion-parameter unified Transformer that generates joint video and audio from a text or image prompt in a single forward pass.

Most AI video users have had the same experience: the more specific the prompt gets, the more the model quietly sands off the details. HappyHorse 1.0 behaves differently — dense prompts often work better here than simplified ones, but only when the structure is doing real work. This guide is a technique-by-technique walkthrough of the patterns that show up repeatedly in production use, condensed into twelve copy-paste examples grounded in publicly available API documentation and community testing.

The Anatomy of a HappyHorse 1.0 Prompt

HappyHorse 1.0 is more tolerant of detailed prompts than most mainstream video models. That tolerance is not a license to write longer — it is a license to keep more concrete detail. The prompt should tell the model what exists in the scene, what moves, how the camera behaves, how light behaves, what the audio bed should sound like, and what the scene should feel like in time.

The structural order that holds together best across use cases is: scene and timing first, subject second, action and motion third, camera language fourth, lighting and texture fifth, audio intent sixth. Early tokens set the rendering "mode" — talking head, product spot, ambient B-roll, multi-shot scene — and later tokens flesh out the detail. Free-form sentences that interleave these layers tend to drift; ordered prompts hold their shape across reruns.

  1. Scene and timing — where and when ("a sunlit Tokyo cafe in late morning, eight seconds").
  2. Subject — who or what is in the frame, including scale, pose, gaze, and action.
  3. Action and motion — what moves and how much ("slowly turns her head toward the window, no more than 5% camera push-in").
  4. Camera language — movement, angle, shot size ("low-angle slow orbit, 180-degree arc, eye height ~30 cm above the ground").
  5. Lighting and texture — direction, quality, time of day ("soft window light from camera-left, shallow depth of field, 50mm lens").
  6. Audio intent — voice-over (in quotes), dialogue (in quotes), foley (named sounds), ambient bed.
The six-block prompt anatomy applied end-to-end on HappyHorse 1.0

For complex prompts, use short labeled segments or line breaks instead of one long paragraph. A skimmable template is easier to maintain in production code and easier to debug when a single section needs tightening.

Example 1 — Invoking the Photorealistic Mode

HappyHorse 1.0 has an explicit photorealistic rendering mode, and the most reliable way to engage it is to use the word "photorealistic" — or adjacent phrases like "shot like a 35mm film photograph", "documentary style", "iPhone-style handheld" — in the prompt. Detailed camera specs (specific lens, sensor, ISO) are interpreted loosely; treat them as look-and-feel cues rather than physical simulation. The bigger lever is naming imperfections: pores, fine lines, fabric wear, available light, slight motion blur. Anti-cues like "no glamorization" and "no heavy retouching" push the model away from the generic AI portrait look.

Example 1 — invoking photorealistic mode with explicit imperfections, Pro mode, 1080p, 16:9, eight seconds

Prompt

A photorealistic candid clip of an elderly fisherman standing on a small wooden fishing boat. Weathered skin, visible wrinkles, sun-darkened arms, faded traditional sailor tattoos. He is calmly adjusting a net while a small dog sits on the deck beside him. Shot like a 35mm film photograph, eye-level medium close-up, 50mm lens. Soft coastal daylight, shallow depth of field, subtle film grain, natural color balance. Audio: gentle waves against the hull, distant gulls, the soft creak of wood, no music, no dialogue. No glamorization, no heavy retouching. 1080p, 16:9, eight seconds.

Example 2 — Camera Language

HappyHorse 1.0 follows explicit camera-language tokens. The reliable cookbook of moves: push-in, pull-back, dolly, tracking shot, orbit, handheld follow, locked-off, slow pan, tilt up, tilt down, crane up, crane down. Combine a movement with an angle (low-angle, eye-level, overhead, Dutch) and a shot size (wide, medium, close-up, extreme close-up) to nail the cinematographic register. For trajectories, specify both endpoints — "from frame-right to frame-left" — instead of leaving the start and end positions for the model to infer.

MoveMeaningPrompt language
Push inCamera moves toward the subjectslow push-in, dolly in, no more than X%
Pull backCamera moves away from the subjectpull back, dolly out, reveal
TrackCamera moves laterally beside the subjecttracking shot, side tracking, parallel
OrbitCamera circles around the subjectslow orbit, 180-degree arc, low-angle orbit
FollowCamera follows a moving subjecthandheld follow, steadicam follow
StaticLocked camera with no position changelocked-off camera, static shot
Pan / tiltCamera rotates in placeslow pan, tilt up, tilt down
HappyHorse 1.0 — camera moves that work reliably
Example 2 — explicit camera language with movement, angle, and shot size, Pro mode, 1080p, 16:9

Prompt

A photorealistic editorial sports clip. Scene: a coastal road at golden hour, ocean horizon visible on the right, eight seconds. Subject: a long-distance runner in a charcoal grey training kit, mid-stride, captured running directly toward the camera. Framing: medium-wide, full body visible, feet included. Camera: low-angle tracking shot, eye height roughly 30 cm above the asphalt, parallel to the runner. Subject placed left of center with negative space on the right two-thirds. Lighting: warm golden hour key from camera-right, soft fill from the ocean reflection on the left, long subject shadow falling toward the lower-left. Audio: rhythmic foot strikes, soft wind, distant ocean, no music. 1080p, 16:9, eight seconds.

Example 3 — Dialogue and Lip-Sync (Quoting Rules)

Dialogue is HappyHorse 1.0's signature capability gap from earlier video models. The lip-sync pathway is engaged by placing the literal line in quotes — the model treats quoted text as the dialogue track and aligns mouth shapes to the phonemes inside the same denoising step. For best results: name the language explicitly, place the line in quotes, and add a constraint clause for the audio bed ("no music, ambient room tone only").

The single most useful add-on phrase is "EXACT, verbatim, no extra characters" applied to the literal line. Without it, the model occasionally paraphrases or appends a short interjection; with it, the rendered audio matches the prompt exactly. Use Pro mode for clean dialogue.

Example 3 — dialogue with quoted line and EXACT verbatim phrasing, Pro mode, 1080p, 16:9, five seconds

Prompt

A medium close-up of a man in his early 30s, short dark hair, light grey crewneck sweater, sitting in a sunlit home office. He looks directly at the camera and says calmly in clear English (EXACT, verbatim, no extra characters): "I think the simplest version of the idea is also the strongest." Soft diffused window light from camera-left, slightly out-of-focus bookshelf in the background. Eye-level shot, 50mm lens look, shallow depth of field. Audio: subtle natural ambient room tone, no music, no other voices. 1080p, 16:9, five seconds.

For tricky brand names or uncommon spellings inside dialogue, write them out the way they should be heard. The lip-sync pathway is trained on phonemes, so spelling that better matches pronunciation produces tighter sync than the formal written form.

Example 4 — Multilingual Prompts and Lip-Sync

HappyHorse 1.0 ships native lip-sync support in seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. The prompting rules are identical across languages: place the literal line in quotes, name the language explicitly, and let the model align mouth shapes to the phonemes of that language. For non-Latin scripts (Japanese, Mandarin, Cantonese, Korean), write the line natively rather than romanizing it — the lip-sync pathway is trained on actual phonemes, not on transliterated approximations.

For mixed-language scenes (one character speaking English, another speaking Mandarin), list each line separately with its speaker and language. The model handles the language pairing automatically as long as the elements are unambiguous.

Example 4 — Korean-language talking head with native lip-sync, Pro mode, 1080p, 16:9, five seconds

Prompt

A medium close-up of a young Korean woman in her late 20s, sitting in a quiet Seoul bookshop with shelves of Korean books behind her, slightly out of focus. She looks toward the camera and says clearly in Korean (EXACT, verbatim): "안녕하세요, 오늘 만나서 정말 반갑습니다." Soft diffused warm light from a window camera-right, eye-level shot, 50mm lens, shallow depth of field. Audio: a faint distant page-turn, soft ambient bookshop tone, no music. 1080p, 16:9, five seconds.

Example 5 — People: Scale, Pose, Gaze, and Action Geometry

For people in scenes, describe scale, body framing, gaze, and object interactions. Generic phrases like "a person doing X" tend to drift on body proportion and limb articulation. Concrete phrases — "full body visible, feet included," "child-sized relative to the table," "looking down at the open book, not at the camera," "hands naturally gripping the handlebars" — pin down the geometry.

This is the difference between "two friends laughing" rendering as a stiff promo shot versus rendering as a believable in-the-moment clip. Add an explicit gaze direction whenever you do not want the subject looking at the camera; the default tendency is camera-aware framing.

Example 5 — explicit scale, pose, gaze, and object interaction for a believable scene, Pro mode, 1080p, 16:9, five seconds

Prompt

A photorealistic candid clip. Scene: a sunlit kitchen, late morning, soft window light from camera-left, five seconds. Subject: a six-year-old child sitting at a wooden kitchen table, reading an oversized hardcover picture book. Scale and framing: child-sized relative to the table, the book takes up about half the visible tabletop, full upper body visible. Pose and gaze: leaning slightly forward on the elbows, looking down at the open book — not at the camera. Action: right hand turns a page slowly, left hand resting flat on the corner of the book. Background: a slightly out-of-focus kitchen counter with a fruit bowl. Lens: 50mm, shallow depth of field. Audio: soft page-turn, faint kitchen ambience, no music. No glamorization, no heavy retouching. 1080p, 16:9, five seconds.

Example 6 — Naming Audio Intent Explicitly

HappyHorse 1.0 generates audio by default — if you do not name the audio bed, the model picks one. The most underused part of HappyHorse 1.0 prompting is the audio intent line. State the audio explicitly: voice-over (in quotes), dialogue (in quotes), foley (named sounds — wave bed, traffic, footsteps, fabric rustle, ceramic tap), ambient bed, and music (or "no music"). When you want a specific sound to land at a specific frame, attach it to the visual event ("a soft ceramic tap as the dropper cap is lifted").

Example 6 — explicit audio intent with named foley and ambient bed, Std mode, 1080p, 16:9, eight seconds

Prompt

A wide shot of a rocky North Atlantic beach in late afternoon. Strong wind, white-capped waves crashing against dark stones, a single figure in a long grey raincoat walking from frame-right to frame-left. Subject occupies the lower third, sky takes the upper two-thirds. Audio (named explicitly): realistic ocean wave bed; gusty wind that intensifies during stronger waves and softens between them; the faint cry of a distant gull at the four-second mark; the soft sound of footsteps on wet stone synchronized to the figure's walk. No music, no dialogue. 1080p, 16:9, eight seconds.

Example 7 — Element Compositing with @element Tokens

In the fal API, the `happyhorse_elements` field lets you define named assets — each with a name, a description, and 2–4 input images — and reference them in the prompt as `@element_name`. Maximum three elements per task. This is HappyHorse 1.0's explicit composition layer: it lets you pin a specific dog, a specific person, or a specific product into the scene with no naming ambiguity.

The pattern that works: define the elements first (block at the top of the prompt), then write the scene prompt referencing each element by its `@name`. The model treats each element as a distinct asset rather than blending them into one composite reference.

Example 7 — element compositing with @element tokens, Pro mode, 1080p, 16:9, six seconds

Prompt

Elements: @element_dog: a specific golden retriever (4 reference images of the same dog from different angles). @element_ball: a small bright orange tennis ball (2 reference images). Scene prompt: A sunlit park lawn at golden hour, soft warm side light, slight wind in the grass. @element_dog runs from frame-right toward the camera and catches @element_ball mid-air, slows down, and trots back the way he came. Handheld follow, eye-level, slight motion blur on the legs. Audio: light wind, a soft "thunk" as the ball is caught, distant park ambience, no music. 1080p, 16:9, six seconds.

Example 8 — Reference-to-Video Compositing

When you pass multiple input images to the reference-to-video endpoint, the recommendation is to reference each input by index and description ("Image 1: product photo… Image 2: style reference… Image 3: target scene…") and describe how they interact ("place the bag from Image 1 into the scene of Image 3, match the lighting of Image 2"). Be explicit about which elements move where.

This indexing convention is what unlocks the "insert this product/person into that scene" workflow without re-generating the whole frame. It also lets you combine three or four references in one call — for example, a person from image 1 wearing garments from images 2, 3, and 4 — with the model treating each input as a distinct asset rather than blending them into one composite reference.

Example 8 — reference-to-video with indexed references, Pro mode, 1080p, 16:9, six seconds

Prompt

Image 1: a specific dark-roast coffee bag with a kraft-paper finish and a visible brand mark. Image 2: a wooden kitchen counter at sunrise with golden window light. Place the bag from Image 1 into the scene from Image 2, standing upright on the counter near the window. Match the lighting direction, color temperature, and depth of field of Image 2 so the bag looks naturally captured in the original photo. Slow push-in from a medium shot to a tight close-up on the brand mark. Do not change the bag's shape, color, or brand mark. Audio: soft morning ambience, a single quiet milk-pour sound, no music. 1080p, 16:9, six seconds.

Example 9 — Surgical Edits: "Change Only X, Keep Everything Else"

HappyHorse 1.0 supports targeted edits through the video-edit endpoint without explicit masks, but the prompt has to be tight to avoid drift. The phrase-of-record is: "change only X" + "keep everything else the same" + repeat the preserve list. For genuinely surgical edits — time-of-day swaps, garment color changes, object removals — also explicitly say not to alter motion, framing, camera move, or surrounding objects, and state whether the audio bed should be preserved or regenerated.

The restated preserve list is what differentiates a clean first-pass edit from a near-miss that requires three retries. Repetition is intentional: the model treats both the change instruction and the preserve list as constraints, and listing the preserve elements twice raises the weight on each one.

Example 9 — surgical edit pattern with restated preserve list, Pro mode, 1080p, 16:9, six seconds

Prompt

Take this input clip and change ONLY the white kitchen chairs to chairs made of warm oak wood. Preserve the camera angle, camera move, room lighting, floor shadows, ceiling, walls, table, dishes on the table, plants, and every other object exactly as they appear. Do not alter saturation, contrast, motion, framing, or any object that is not a white chair. Keep the audio bed exactly the same — same room tone, same foley, no music. Photorealistic contact shadows where the new wooden chair legs meet the floor. 1080p, 16:9, six seconds.

Example 10 — Iterative Refinement Beats One Mega-Prompt

Long prompts can work well on HappyHorse 1.0, but debugging is easier when you start with a clean base prompt and refine with small, single-change follow-ups. The recommended pattern: ship a clean first prompt at Std mode, then iterate with phrases like "make the lighting warmer," "remove the music," "tighten the framing." Use references like "same scene as before" or "the subject" to leverage context — but re-specify any critical detail if it begins to drift.

This is the inverse of the design-tool reflex, where each pass adds more constraints. With HappyHorse 1.0, each refinement pass should remove the noise from the prior pass and change at most one or two things. Multi-region simultaneous edits (three or more independent changes in a single call) typically need two or three iterations for a clean result.

Example 10 — three-pass iterative refinement: base prompt → warmer light → tighter framing

Prompt

Pass 1 (base prompt): A photorealistic still life of a single ripe tomato on a wooden cutting board, soft daylight from camera-left, locked-off camera, 50mm lens, shallow depth of field. Audio: faint kitchen ambience, no music. 1080p, 16:9, five seconds. Pass 2 (refinement, edit on output of Pass 1): Make the lighting warmer — shift toward golden-hour color temperature, add a subtle rim light on the right side of the tomato. Keep everything else the same: same tomato, same cutting board, same composition, same framing, same audio bed. Pass 3 (refinement, edit on output of Pass 2): Tighten the framing — crop in by about 20%, the tomato should fill more of the frame, cutting board still partially visible at the bottom. Do not change the lighting, color grade, surface texture, or audio. Restate: keep the same tomato, the same cutting board grain, the same background.

Example 11 — Leveraging Bilingual World Knowledge

HappyHorse 1.0 was trained by a team operating in both Chinese and English, which gives the model unusually strong world knowledge for culturally-specific scenes — Beijing hutong courtyards, Cantonese street markets, Japanese izakaya interiors, Korean bookshops. The implication for prompting: write the scene in whichever language better captures the cultural specificity. For East Asian scenes, native-language prompts often produce more physically accurate detail than translated equivalents.

For any well-documented event, place, or cultural moment, you can prompt with situational cues — a date, a place, a notable scene — and let the model infer the visual context without spelling it out. This works for pre-2026 references; for very recent post-launch events, provide a reference image instead of relying on world knowledge.

Example 11 — Mandarin-language world-knowledge prompting for a culturally-specific scene, Pro mode, 1080p, 16:9, eight seconds

Prompt

一条传统的北京胡同清晨场景。一位老人骑着一辆旧二八自行车,缓缓从画面右侧驶向左侧。两侧是灰砖灰瓦的四合院围墙,几只鸽子从屋顶飞过。柔和的清晨阳光从画面左上方洒下,地面上有淡淡的晨雾。镜头位置低角度,缓慢横移跟随老人。声音:远处的鸽哨声、自行车链条的轻响、几声晨练的吆喝。1080p,16:9,八秒。

Caveat: world knowledge is bounded by the training cutoff. For post-cutoff brand identities, product designs, or 2026 events that emerged after the model was trained, provide a reference image rather than relying on the model to infer the look. The model will not flag the gap — it will silently invent.

Example 12 — Multi-Shot Sequences in a Single Call

For sequence work — short narrative scenes, brand spots with a beginning-middle-end, storyboard-to-shot conversions — the fal text-to-video endpoint exposes a multi-shot mode. Up to five shots, each up to twelve seconds, individually prompted, generated in one call with character identity and visual style preserved across shots. The pattern that works: describe the protagonist once at the top, then list each shot as a numbered beat with its own framing, action, lighting, and audio. State the consistency requirements explicitly — face, hair, wardrobe — so the model preserves them across all shots.

Example 12 — multi-shot sequence with explicit shot-by-shot beats, Pro mode, 1080p, 16:9, twelve seconds total

Prompt

A four-shot coffee shop sequence in multi-shot mode. Same character throughout: a tall man in his mid-30s, dark hair, a charcoal grey wool coat over a cream sweater. Keep his face, hair, coat, and sweater identical across all four shots. Shot 1 (3 s): wide establishing shot, he enters a small bright coffee shop on a rainy morning, water on the windows. Audio: gentle rain outside, door bell. Shot 2 (3 s): medium shot at the counter, he orders, light steam from an espresso machine. Audio: soft espresso machine hiss, faint chatter. Shot 3 (3 s): over-the-shoulder of the barista pouring milk into a small cup, latte art forming. Audio: milk pouring sound, gentle steam. Shot 4 (3 s): tight beauty close-up on the finished cup placed on the counter, his hand entering the frame to pick it up. Audio: ceramic tap, quiet music in the background — soft acoustic guitar. 1080p, 16:9, twelve seconds total.

For data-heavy scenes where exact frame timing matters (a sound that must land at a specific second), attach the audio cue to the visual event in the same shot block ("a soft ceramic tap as the dropper cap is lifted"). The model treats those event-aligned audio descriptions as render targets, not loose suggestions.

Mode and Resolution: Picking the Right Settings

HappyHorse 1.0 exposes two modes — Std (the DMD-2 distilled 8-step student) and Pro (an extended denoising schedule) — and supports clip lengths from 3 to 15 seconds in five aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4) at 720p or 1080p. Audio is on by default and adds about 50% to credit usage. Std is good enough for short-form, ideation, and any clip that goes through downstream review. Pro is the right default for dialogue scenes, brand hero shots, multi-shot sequences, and image-to-video animations where fidelity is the bottleneck.

WorkflowRecommended ResolutionRecommended ModeAspectNotes
Drafts, ideation, batch generation720pStd9:16 or 16:9Cheapest; turn audio off for additional savings.
Talking head with English lip-sync1080pPro16:9Pro recovers cleaner phoneme alignment.
Multilingual lip-sync (CJK, DE, FR)1080pPro16:9Pro is meaningful here; Std drops occasional phonemes.
Cinematic product spot1080pPro16:9Joint audio is the headline value.
Image-to-video animation1080pProMatch inputI2V is the model's #1 arena category.
Vertical TikTok / Reels1080pStd9:16Std is plenty for thumb-zone content.
Multi-shot scene (up to 5 shots)1080pPro16:9Character consistency benefits from Pro.
Element compositing (@elements)1080pPro16:9Pro recovers fine detail on named assets.
Reference-to-video product placement1080pPro16:9 or 9:16Brand-mark accuracy benefits from Pro.
Stylized I2V transform1080pProMatch inputPainterly motion details need Pro.
Natural-language video editMatch inputProMatch inputEdits preserve input fidelity automatically.
15-second hero clip1080pPro16:9Upper duration boundary; consider chaining for longer.
HappyHorse 1.0 — mode, resolution, and aspect ratio recommendations by workflow

Common Pitfalls and How to Avoid Them

  • Generic style boosters ("8K, ultra-detailed, masterpiece, cinematic") are mostly leftover patterns from earlier diffusion models. Spend that prompt budget on motion, audio, and constraints instead.
  • Forgetting to quote dialogue. Without quotes, the model paraphrases the line and lip-sync degrades. With quotes plus "EXACT, verbatim", the line is treated as a fixed dialogue track and aligned to phonemes inside the same denoising step.
  • Asking for too much camera motion on a still input. Image-to-video works best with small, named motion ("slow push-in, no more than 5%"). Large camera moves on a still tend to break the source composition.
  • Skipping the audio intent. HappyHorse 1.0 generates audio by default — if you do not name the bed, the model picks one. State it explicitly: voice-over, dialogue, foley, ambient bed, music or no music.
  • Mixing prompt languages mid-prompt. Pick English or Mandarin and stay with it across all six elements. Mixing languages occasionally produces interpretation drift on the camera move.
  • Trying to change three or more independent regions in a single edit. Multi-region edits typically need two or three iterations for a clean result. Break the edit into sequential single-change passes.
  • Crowding multi-shot mode. The fal multi-shot endpoint caps at five shots, twelve seconds each. More than that and the model treats it as one continuous clip.
  • Skipping the preserve list on iterative edits. Each refinement pass should restate the invariants — without that restatement, drift compounds across passes.
  • Treating duration as a free parameter on multi-shot. When `multi_shots` is true, the final duration is derived from the sum of `multi_prompt` durations; if you also send `duration`, keep it equal to the summed value or omit it.

A Reusable Prompt Template

If you take one thing from this guide, take the template. It follows the recommended structural order and works as a copy-paste starting point for almost every use case in this article:

Intended use → Scene and timing → Subject (with scale, pose, gaze) → Action and motion → Camera language (movement, angle, shot size) → Lighting and texture (direction, quality) → Literal in-prompt dialogue in quotes (EXACT, verbatim) → Audio intent (voice-over, foley, ambient bed, music or no music) → Resolution, aspect ratio, duration.

Start at Std mode, 720p, 16:9 (or 9:16 for short-form), audio on. Run two generations to calibrate the prompt, then move to Pro mode and 1080p for the final asset. For refinements, edit the existing clip with a natural-language instruction rather than regenerating from scratch — the latter is the single biggest source of drift in production work.