Seedance 2.0 Prompting Guide: 12 Techniques for Quad-Modal Video Generation

Seedance 2.0
Prompting Guide
Video Generation
ByteDance

Seedance 2.0 launched on February 12, 2026 from ByteDance's Seed team. It is the first production AI video model to accept text, image, audio, and video as simultaneous inputs and emit synchronized video and dual-channel audio in a single forward pass — up to 9 reference images, 3 video clips, and 3 audio clips per call, 4 to 15 seconds of multi-shot output, native phoneme-level lip-sync in 8+ languages, and a Seedance 2.0 Fast variant for low-latency batch work. The model is exposed as `doubao-seedance-2-0-260128` on Volcano Engine Ark (China) and BytePlus Ark (international).

Most AI video users have had the same experience: the more specific the prompt gets, the more the model quietly sands off the details. Seedance 2.0 behaves differently — dense prompts often work better here than simplified ones, but only when the structure is doing real work. This guide is a technique-by-technique walkthrough of the patterns that show up repeatedly in production use, condensed into twelve copy-paste examples grounded in ByteDance's technical documentation and community testing.

The Anatomy of a Seedance 2.0 Prompt

Seedance 2.0 is more tolerant of detailed prompts than most mainstream video models, and uniquely tolerant of prompts that mix text with multimodal references. That tolerance is not a license to write longer — it is a license to keep more concrete detail and to declare each reference's role explicitly.

The structural order that holds together best across use cases is: reference declarations first, scene and timing second, subject third, camera language fourth, lighting and texture fifth, audio intent sixth. Early tokens set the rendering "mode" — quad-modal composition, multi-shot, talking head, product spot — and later tokens flesh out the detail. Free-form sentences that interleave these layers tend to drift; ordered prompts hold their shape across reruns.

  1. Reference declarations — list each image, video, and audio reference with its index and role ("Image 1: protagonist", "Video Clip 1: camera move").
  2. Scene and timing — where and when ("a sunlit Tokyo cafe in late morning, ten seconds").
  3. Subject — who or what is in the frame, including scale, pose, gaze, and action.
  4. Camera language — movement, angle, shot size ("low-angle slow orbit, 180-degree arc, eye height ~30 cm above the ground").
  5. Lighting and texture — direction, quality, time of day ("soft window light from camera-left, shallow depth of field, 50mm lens").
  6. Audio intent — voice-over (in quotes), dialogue (in quotes), foley (named sounds), ambient bed.
The six-block prompt anatomy applied end-to-end on Seedance 2.0

For complex prompts, use short labeled segments or line breaks instead of one long paragraph. A skimmable template is easier to maintain in production code and easier to debug when a single section needs tightening.

Example 1 — Invoking the Photorealistic Mode

Seedance 2.0 has an explicit photorealistic rendering register, and the most reliable way to engage it is to use the word "photorealistic" — or adjacent phrases like "shot like a 35mm film photograph", "documentary style", "iPhone-style handheld" — in the prompt. Detailed camera specs (specific lens, sensor, ISO) are interpreted loosely; treat them as look-and-feel cues rather than physical simulation. The bigger lever is naming imperfections: pores, fine lines, fabric wear, available light, slight motion blur. Anti-cues like "no glamorization" and "no heavy retouching" push the model away from the generic AI portrait look.

Example 1 — invoking photorealistic mode with explicit imperfections, 720p, 16:9, ten seconds

Prompt

A photorealistic candid clip of an elderly fisherman standing on a small wooden fishing boat. Weathered skin, visible wrinkles, sun-darkened arms, faded traditional sailor tattoos. He is calmly adjusting a net while a small dog sits on the deck beside him. Shot like a 35mm film photograph, eye-level medium close-up, 50mm lens. Soft coastal daylight, shallow depth of field, subtle film grain, natural color balance. Audio (dual-channel stereo): gentle waves against the hull, distant gulls, the soft creak of wood, no music, no dialogue. No glamorization, no heavy retouching. 720p, 16:9, ten seconds.

Example 2 — Camera Language and Prompt-Driven Camera Planning

Seedance 2.0 follows explicit camera-language tokens — the model card calls this "prompt-driven camera planning." The reliable cookbook of moves: push-in, pull-back, dolly, tracking shot, orbit, handheld follow, locked-off, slow pan, tilt up, tilt down, crane up, crane down. Combine a movement with an angle (low-angle, eye-level, overhead, Dutch) and a shot size (wide, medium, close-up, extreme close-up) to nail the cinematographic register. For trajectories, specify both endpoints — "from frame-right to frame-left" — instead of leaving the start and end positions for the model to infer.

MoveMeaningPrompt language
Push inCamera moves toward the subjectslow push-in, dolly in, no more than X%
Pull backCamera moves away from the subjectpull back, dolly out, reveal
TrackCamera moves laterally beside the subjecttracking shot, side tracking, parallel
OrbitCamera circles around the subjectslow orbit, 180-degree arc, low-angle orbit
FollowCamera follows a moving subjecthandheld follow, steadicam follow
StaticLocked camera with no position changelocked-off camera, static shot
Pan / tiltCamera rotates in placeslow pan, tilt up, tilt down
Seedance 2.0 — camera moves that work reliably
Example 2 — explicit camera language with movement, angle, and shot size, 720p, 16:9

Prompt

A photorealistic editorial sports clip. Scene: a coastal road at golden hour, ocean horizon visible on the right, ten seconds. Subject: a long-distance runner in a charcoal grey training kit, mid-stride, captured running directly toward the camera. Framing: medium-wide, full body visible, feet included. Camera: low-angle tracking shot, eye height roughly 30 cm above the asphalt, parallel to the runner. Subject placed left of center with negative space on the right two-thirds. Lighting: warm golden hour key from camera-right, soft fill from the ocean reflection on the left, long subject shadow falling toward the lower-left. Audio: rhythmic foot strikes, soft wind, distant ocean, no music. 720p, 16:9, ten seconds.

Example 3 — Dialogue and Phoneme-Level Lip-Sync (Quoting Rules)

Dialogue is one of Seedance 2.0's most distinctive capabilities. The lip-sync pathway is engaged by placing the literal line in quotes — the model treats quoted text as the dialogue track and aligns mouth shapes to phonemes inside the same denoising step that produces the rest of the frame. For best results: name the language explicitly, place the line in quotes, and add a constraint clause for the audio bed ("no music, ambient room tone only").

The single most useful add-on phrase is "EXACT, verbatim, no extra characters" applied to the literal line. Without it, the model occasionally paraphrases or appends a short interjection; with it, the rendered audio matches the prompt exactly.

Example 3 — dialogue with quoted line and EXACT verbatim phrasing, 720p, 16:9, six seconds

Prompt

A medium close-up of a man in his early 30s, short dark hair, light grey crewneck sweater, sitting in a sunlit home office. He looks directly at the camera and says calmly in clear English (EXACT, verbatim, no extra characters): "I think the simplest version of the idea is also the strongest." Soft diffused window light from camera-left, slightly out-of-focus bookshelf in the background. Eye-level shot, 50mm lens look, shallow depth of field. Audio: subtle natural ambient room tone, no music, no other voices. 720p, 16:9, six seconds.

For tricky brand names or uncommon spellings inside dialogue, write them out the way they should be heard. The lip-sync pathway is trained on phonemes, so spelling that better matches pronunciation produces tighter sync than the formal written form.

Example 4 — Multilingual Prompts and Phoneme-Level Lip-Sync

Seedance 2.0 ships native phoneme-level lip-sync in 8+ languages: English, Chinese (Mandarin), Japanese, Korean, Spanish, French, German, and Portuguese. The prompting rules are identical across languages: place the literal line in quotes, name the language explicitly, and let the model align mouth shapes to the phonemes of that language. For non-Latin scripts (Japanese, Mandarin, Korean), write the line natively rather than romanizing it.

For mixed-language scenes (one character speaking English, another speaking Mandarin), list each line separately with its speaker and language. The model handles the language pairing automatically as long as the elements are unambiguous.

Example 4 — Korean-language talking head with phoneme-level lip-sync, 720p, 16:9, six seconds

Prompt

A medium close-up of a young Korean woman in her late 20s, sitting in a quiet Seoul bookshop with shelves of Korean books behind her, slightly out of focus. She looks toward the camera and says clearly in Korean (EXACT, verbatim): "안녕하세요, 오늘 만나서 정말 반갑습니다." Soft diffused warm light from a window camera-right, eye-level shot, 50mm lens, shallow depth of field. Audio: a faint distant page-turn, soft ambient bookshop tone, no music. 720p, 16:9, six seconds.

Example 5 — People: Scale, Pose, Gaze, and Action Geometry

For people in scenes, describe scale, body framing, gaze, and object interactions. Generic phrases like "a person doing X" tend to drift on body proportion and limb articulation. Concrete phrases — "full body visible, feet included," "child-sized relative to the table," "looking down at the open book, not at the camera," "hands naturally gripping the handlebars" — pin down the geometry.

This is the difference between "two friends laughing" rendering as a stiff promo shot versus rendering as a believable in-the-moment clip. Add an explicit gaze direction whenever you do not want the subject looking at the camera; the default tendency is camera-aware framing.

Example 5 — explicit scale, pose, gaze, and object interaction, 720p, 16:9, six seconds

Prompt

A photorealistic candid clip. Scene: a sunlit kitchen, late morning, soft window light from camera-left, six seconds. Subject: a six-year-old child sitting at a wooden kitchen table, reading an oversized hardcover picture book. Scale and framing: child-sized relative to the table, the book takes up about half the visible tabletop, full upper body visible. Pose and gaze: leaning slightly forward on the elbows, looking down at the open book — not at the camera. Action: right hand turns a page slowly, left hand resting flat on the corner of the book. Background: a slightly out-of-focus kitchen counter with a fruit bowl. Lens: 50mm, shallow depth of field. Audio: soft page-turn, faint kitchen ambience, no music. No glamorization, no heavy retouching. 720p, 16:9, six seconds.

Example 6 — Naming Audio Intent Explicitly (Dual-Channel Stereo)

Seedance 2.0 generates dual-channel stereo audio by default — if you do not name the audio bed, the model picks one. The most underused part of Seedance 2.0 prompting is the audio intent line. State the audio explicitly: voice-over (in quotes), dialogue (in quotes), foley (named sounds — wave bed, traffic, footsteps, fabric rustle, ceramic tap), ambient bed, and music (or "no music"). When you want a specific sound to land at a specific frame, attach it to the visual event ("a soft ceramic tap as the dropper cap is lifted"). For stereo placement, name the side ("ocean wave bed panned across the frame from camera-right to camera-left").

Example 6 — explicit dual-channel stereo audio intent, 720p, 16:9, ten seconds

Prompt

A wide shot of a rocky North Atlantic beach in late afternoon. Strong wind, white-capped waves crashing against dark stones, a single figure in a long grey raincoat walking from frame-right to frame-left. Subject occupies the lower third, sky takes the upper two-thirds. Audio (dual-channel stereo, named explicitly): ocean wave bed panned across the frame; gusty wind that intensifies during stronger waves and softens between them; the faint cry of a distant gull at the four-second mark; the soft sound of footsteps on wet stone synchronized to the figure's walk. No music, no dialogue. 720p, 16:9, ten seconds.

Example 7 — Indexed Reference Composition

When you pass multiple input images, video clips, or audio clips, the recommendation is to reference each input by index and role ("Image 1: protagonist", "Image 2: location", "Video Clip 1: camera move", "Audio Clip 1: ambient bed") and describe how they interact ("place the protagonist from Image 1 into the location from Image 2; apply the camera move from Video Clip 1; use the ambience from Audio Clip 1"). Be explicit about which elements move where.

This indexing convention is what unlocks the "drop this product/person into that scene" workflow without re-generating the whole frame. It also lets you combine many references in one call — for example, a person from image 1 wearing garments from images 2, 3, and 4 — with the model treating each input as a distinct asset rather than blending them into one composite reference. Reference budget: 9 images + 3 videos + 3 audios per call.

Example 7 — indexed reference composition, 720p, 16:9, eight seconds

Prompt

Image 1: a specific dark-roast coffee bag with a kraft-paper finish and a visible brand mark. Image 2: a wooden kitchen counter at sunrise with golden window light. Audio Clip 1: a soft milk-pour and morning kitchen ambience. Place the bag from Image 1 into the scene from Image 2, standing upright on the counter near the window. Match the lighting direction, color temperature, and depth of field of Image 2 so the bag looks naturally captured in the original photo. Use the ambience from Audio Clip 1. Slow push-in from a medium shot to a tight close-up on the brand mark. Do not change the bag's shape, color, or brand mark. 720p, 16:9, eight seconds.

Example 8 — Multi-Shot Sequences in a Single Call

For sequence work — short narrative scenes, brand spots with a beginning-middle-end, storyboard-to-shot conversions — Seedance 2.0 generates multi-shot sequences inside a single 15-second render. The pattern that works: describe the protagonist once at the top, then list each shot as a numbered beat with its own framing, action, lighting, and audio. State the consistency requirements explicitly — face, hair, wardrobe — so the model preserves them across all shots. Total duration must fit inside the 15-second render budget.

Example 8 — multi-shot sequence with explicit shot-by-shot beats, 720p, 16:9, fifteen seconds total

Prompt

A four-shot coffee shop sequence in multi-shot mode. Same character throughout: a tall man in his mid-30s, dark hair, a charcoal grey wool coat over a cream sweater. Keep his face, hair, coat, and sweater identical across all four shots. Shot 1 (4 s): wide establishing shot, he enters a small bright coffee shop on a rainy morning, water on the windows. Audio: gentle rain outside, door bell. Shot 2 (4 s): medium shot at the counter, he orders, light steam from an espresso machine. Audio: soft espresso machine hiss, faint chatter. Shot 3 (3 s): over-the-shoulder of the barista pouring milk into a small cup, latte art forming. Audio: milk pouring sound, gentle steam. Shot 4 (4 s): tight beauty close-up on the finished cup placed on the counter, his hand entering the frame to pick it up. Audio: ceramic tap, quiet music in the background — soft acoustic guitar. 720p, 16:9, fifteen seconds total.

For data-heavy scenes where exact frame timing matters (a sound that must land at a specific second), attach the audio cue to the visual event in the same shot block ("a soft ceramic tap as the dropper cap is lifted"). The model treats those event-aligned audio descriptions as render targets, not loose suggestions.

Example 9 — Surgical Edits: "Change Only X, Keep Everything Else"

Seedance 2.0 supports targeted edits to specified clips, characters, actions, and storylines through the editing endpoint without explicit masks, but the prompt has to be tight to avoid drift. The phrase-of-record is: "change only X" + "keep everything else the same" + repeat the preserve list. For genuinely surgical edits — time-of-day swaps, garment color changes, object removals — also explicitly say not to alter motion, framing, camera move, or surrounding objects, and state whether the audio bed should be preserved or regenerated.

The restated preserve list is what differentiates a clean first-pass edit from a near-miss that requires three retries. Repetition is intentional: the model treats both the change instruction and the preserve list as constraints, and listing the preserve elements twice raises the weight on each one.

Example 9 — surgical edit pattern with restated preserve list, 720p, 16:9, eight seconds

Prompt

Take this input clip and change ONLY the white kitchen chairs to chairs made of warm oak wood. Preserve the camera angle, camera move, room lighting, floor shadows, ceiling, walls, table, dishes on the table, plants, and every other object exactly as they appear. Do not alter saturation, contrast, motion, framing, or any object that is not a white chair. Keep the audio bed exactly the same — same room tone, same foley, no music. Photorealistic contact shadows where the new wooden chair legs meet the floor. 720p, 16:9, eight seconds.

Example 10 — Video Extension: "Continuing the Shoot"

Seedance 2.0 ships a video extension capability the team describes as "continuing the shoot." It accepts an existing clip plus a continuation prompt and emits new shots that pick up where the input left off — same character, same location, same visual style. The pattern that works: feed the input clip, describe what happens next as one or more shots with explicit beats, and re-state the invariants you need preserved (character identity, location, time of day, audio bed continuity).

Example 10 — video extension picking up from an existing clip, 720p, 16:9, ten seconds

Prompt

Take this input clip — the man in the charcoal coat picking up his coffee cup at the counter — and continue the shoot with two new shots inside the same coffee shop. Same character, same coat, same sweater, same bookshelf and counter behind him. Shot 1 (5 s): medium tracking shot, he walks from the counter to a small two-person table by the window, places the cup down, and sits. Shot 2 (5 s): static eye-level close-up of him taking the first sip, eyes closing briefly, a slow exhale. Audio: continued ambient cafe tone, soft espresso machine, no music. 720p, 16:9, ten seconds total.

Example 11 — Leveraging Bilingual World Knowledge

Seedance 2.0 was trained by ByteDance's Seed team operating in both Chinese and English, which gives the model unusually strong world knowledge for culturally-specific scenes — Beijing hutong courtyards, Cantonese street markets, Sichuan teahouses, Korean bookshops, Japanese izakaya interiors. The implication for prompting: write the scene in whichever language better captures the cultural specificity. For East Asian scenes, native-language prompts often produce more physically accurate detail than translated equivalents.

For any well-documented event, place, or cultural moment, you can prompt with situational cues — a date, a place, a notable scene — and let the model infer the visual context without spelling it out. This works for pre-launch references; for very recent post-launch events, provide a reference image instead of relying on world knowledge.

Example 11 — Mandarin-language world-knowledge prompting for a culturally-specific scene, 720p, 16:9, ten seconds

Prompt

一条传统的北京胡同清晨场景。一位老人骑着一辆旧二八自行车,缓缓从画面右侧驶向左侧。两侧是灰砖灰瓦的四合院围墙,几只鸽子从屋顶飞过。柔和的清晨阳光从画面左上方洒下,地面上有淡淡的晨雾。镜头位置低角度,缓慢横移跟随老人。声音:远处的鸽哨声、自行车链条的轻响、几声晨练的吆喝。720p,16:9,十秒。

Caveat: world knowledge is bounded by the training cutoff. For post-cutoff brand identities, product designs, or 2026 events that emerged after the model was trained, provide a reference image rather than relying on the model to infer the look. The model will not flag the gap — it will silently invent.

Example 12 — Iterative Refinement Beats One Mega-Prompt

Long prompts can work well on Seedance 2.0, but debugging is easier when you start with a clean base prompt and refine with small, single-change follow-ups. The recommended pattern: ship a clean first prompt at the Fast variant, then iterate with phrases like "make the lighting warmer," "remove the music," "tighten the framing." Use references like "same scene as before" or "the subject" to leverage context — but re-specify any critical detail if it begins to drift.

This is the inverse of the design-tool reflex, where each pass adds more constraints. With Seedance 2.0, each refinement pass should remove the noise from the prior pass and change at most one or two things. Multi-region simultaneous edits (three or more independent changes in a single call) typically need two or three iterations for a clean result.

Example 12 — three-pass iterative refinement: base prompt → warmer light → tighter framing

Prompt

Pass 1 (base prompt, Seedance 2.0 Fast): A photorealistic still life of a single ripe tomato on a wooden cutting board, soft daylight from camera-left, locked-off camera, 50mm lens, shallow depth of field. Audio: faint kitchen ambience, no music. 720p, 16:9, six seconds. Pass 2 (refinement, edit on output of Pass 1): Make the lighting warmer — shift toward golden-hour color temperature, add a subtle rim light on the right side of the tomato. Keep everything else the same: same tomato, same cutting board, same composition, same framing, same audio bed. Pass 3 (refinement, edit on output of Pass 2): Tighten the framing — crop in by about 20%, the tomato should fill more of the frame, cutting board still partially visible at the bottom. Do not change the lighting, color grade, surface texture, or audio. Restate: keep the same tomato, the same cutting board grain, the same background.

Variant and Resolution: Picking the Right Settings

Seedance 2.0 ships in two variants — the full Seedance 2.0 and Seedance 2.0 Fast — and supports clip lengths from 4 to 15 seconds in six aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4, 21:9) at 480p or 720p natively. Audio is on by default and is dual-channel stereo. Seedance 2.0 Fast is good enough for short-form, ideation, and any clip that goes through downstream review. The full Seedance 2.0 is the right default for dialogue scenes, brand hero shots, multi-shot sequences, and quad-modal reference compositions where fidelity is the bottleneck.

WorkflowRecommended ResolutionRecommended VariantAspectNotes
Drafts, ideation, batch generation480pSeedance 2.0 Fast9:16 or 16:9Cheapest path; fastest turnaround.
Talking head with English lip-sync720pSeedance 2.016:9Phoneme-level alignment is cleanest on the full model.
Multilingual lip-sync (8+ languages)720pSeedance 2.016:9Full model is meaningfully better for non-English phonemes.
Cinematic product spot720pSeedance 2.016:9 or 21:9Joint dual-channel audio is the headline value.
Quad-modal reference composition720pSeedance 2.0Match briefReference fidelity benefits from the full model.
Image-to-video animation720pSeedance 2.0Match inputFirst/last-frame conditioning supported.
Vertical TikTok / Reels720pSeedance 2.0 Fast9:16Fast is plenty for thumb-zone content.
Multi-shot scene (up to 15 s)720pSeedance 2.016:9Character consistency benefits from the full model.
21:9 cinematic widescreen720pSeedance 2.021:9Native 21:9 composition.
Stylized I2V transform720pSeedance 2.0Match inputPainterly motion details need the full model.
Targeted video editMatch inputSeedance 2.0Match inputEdits preserve input fidelity automatically.
Video extension ("continuing the shoot")Match inputSeedance 2.0Match inputRe-state invariants for clean continuation.
Seedance 2.0 — variant, resolution, and aspect ratio recommendations by workflow

Common Pitfalls and How to Avoid Them

  • Generic style boosters ("8K, ultra-detailed, masterpiece, cinematic") are mostly leftover patterns from earlier diffusion models. Spend that prompt budget on motion, audio, and reference declarations instead.
  • Forgetting to declare reference roles. When passing multiple images, videos, or audio clips, declare each one with an index and a role at the top of the prompt. Without role declarations, references blend rather than stay distinct.
  • Forgetting to quote dialogue. Without quotes, the model paraphrases the line and lip-sync degrades. With quotes plus "EXACT, verbatim", the line is treated as a fixed dialogue track and aligned to phonemes inside the same denoising step.
  • Asking for too much camera motion on a still input. Image-to-video works best with small, named motion ("slow push-in, no more than 5%"). Large camera moves on a still tend to break the source composition.
  • Skipping the audio intent. Seedance 2.0 generates dual-channel stereo audio by default — if you do not name the bed, the model picks one. State it explicitly: voice-over, dialogue, foley, ambient bed, music or no music.
  • Mixing prompt languages mid-prompt. Pick English or Mandarin and stay with it across all six elements. Mixing languages occasionally produces interpretation drift on the camera move.
  • Trying to change three or more independent regions in a single edit. Multi-region edits typically need two or three iterations for a clean result. Break the edit into sequential single-change passes.
  • Crowding multi-shot mode. Multi-shot fits inside a 15-second render budget; the sum of per-shot durations cannot exceed it. For longer narratives, chain calls or use the video extension capability.
  • Skipping the preserve list on iterative edits. Each refinement pass should restate the invariants — without that restatement, drift compounds across passes.
  • Exceeding the reference asset cap. Seedance 2.0 accepts up to 9 images, 3 videos, and 3 audios per call. Beyond that, the model treats inputs as a soft composite rather than as distinct references.

A Reusable Prompt Template

If you take one thing from this guide, take the template. It follows the recommended structural order and works as a copy-paste starting point for almost every use case in this article:

Reference declarations (Image 1: …, Image 2: …, Video Clip 1: …, Audio Clip 1: …) → Intended use → Scene and timing → Subject (with scale, pose, gaze) → Camera language (movement, angle, shot size) → Lighting and texture (direction, quality) → Literal in-prompt dialogue in quotes (EXACT, verbatim) → Audio intent (voice-over, foley, ambient bed, music or no music) → Resolution, aspect ratio, duration.

Start at Seedance 2.0 Fast, 480p, 16:9 (or 9:16 for short-form), audio on. Run two generations to calibrate the prompt, then move to the full Seedance 2.0 and 720p for the final asset. For refinements, edit the existing clip with a natural-language instruction rather than regenerating from scratch — the latter is the single biggest source of drift in production work.