AI Shorts Pipeline
The AI Shorts pipeline generates production-ready short-form video (TikTok, Reels, Shorts) from a single topic string. It orchestrates script generation, AI actor creation, voiceover, b-roll video, background music, talking heads, lip-sync, subtitles, and final composition — all as a single workflow.
Quick Start
Section titled “Quick Start”# Minimal — topic only (faceless, 45s, educational)fabric run video/ai-shorts --input topic="Why cats always land on their feet"
# With a talking-head presenterfabric run video/ai-shorts \ --input topic="AI in healthcare" \ --input actor_image_url="https://example.com/presenter.png" \ --input voice=narrator
# Full controlfabric run video/ai-shorts \ --input topic="5 morning habits that changed my life" \ --input style=listicle \ --input duration_seconds=30 \ --input audience=professionals \ --input quality=premiumfrom fabric_platform import FabricClient
fabric = FabricClient()
# Faceless shortrun = fabric.ai_shorts("Why sleep is a superpower")
# Talking-head with custom voicerun = fabric.ai_shorts( "AI is replacing junior developers", actor_image_url="https://example.com/presenter.png", voice="energetic-male", style="promotional", duration_seconds=60,)print(run["output"]["title"])print(run["output"]["video_asset_id"])import { FabricClient } from "@fabric-platform/sdk";const fabric = new FabricClient();
const run = await fabric.workflows.runs.submit("video/ai-shorts", { input: { topic: "Why sleep is a superpower", actor_image_url: "https://example.com/presenter.png", voice: "warm", style: "storytelling", duration_seconds: 45, },});const result = await fabric.workflows.runs.waitForRun(run.id);Pipeline Architecture
Section titled “Pipeline Architecture” generate_script │ generate_keyframes (opt-in) │ multiview_character_prep (premium/ultra) │ ┌─────────┼─────────────────────┐ │ │ │ │ generate_ generate_ generate_ generate_ ai_actor voiceover broll (×3) bgm │ │ │ │ └─────────┼─────────────────────┘ │ merge_generation │ generate_talking_heads │ lipsync_talking_heads │ mix_audio │ transcribe_voiceover │ compose_timeline │ burn_subtitles │ burn_hook_overlay │ effects_pipeline │ collect_final_outputThe 6-way fork (actor, voiceover, b-roll x3, music) runs in parallel for maximum throughput. On a fast connection with remote models, a 45-second video completes in 3-5 minutes.
Input Parameters
Section titled “Input Parameters”Simple inputs
Section titled “Simple inputs”| Parameter | Type | Default | Description |
|---|---|---|---|
topic | string | required | The subject of the video |
actor_image_url | string | null | Public URL to a portrait image. When set → talking-head with lip sync. When omitted → faceless b-roll video. |
voice | string | "alloy" | TTS voice style (narrator, warm, energetic-male, energetic-female, deep, friendly) |
duration_seconds | int | 45 | Target duration (15-120 seconds) |
style | string | "educational" | educational, promotional, storytelling, listicle — maps to mood internally |
audience | string | "general" | Target audience: general, teens, professionals, technical, kids |
platforms | string[] | [] | Platform export slugs: tiktok, instagram, youtube_shorts, youtube, linkedin. When set, output includes resized variants with platform-specific metadata. |
Advanced overrides
Section titled “Advanced overrides”These bypass the simple inputs above for full control:
| Parameter | Type | Default | Description |
|---|---|---|---|
mood | string | derived from style | Direct mood/tone override (e.g. “high-energy and conversational”) |
hook | string | "" | Specific hook line (auto-generated if empty) |
platform | string | "TikTok" | Target platform (affects pacing and framing) |
presenter_look | string | "" | AI actor appearance description |
visual_style | string | "" | Override visual aesthetic (e.g. “neon cyberpunk”) |
quality | string | "" | Quality preset: budget, standard, premium, ultra, local, local-power |
use_keyframe_grid | bool | false | Enable 2x2 grid keyframe generation |
Quality Presets
Section titled “Quality Presets”Quality presets control which models are used at each pipeline stage:
| Preset | Script | TTS | Avatar | B-Roll | Lip-sync | Music |
|---|---|---|---|---|---|---|
cheap | Gemini Flash | Kokoro | Kling v2 | Veo 3.1 | FAL | Stable Audio |
premium | Gemini Flash | ElevenLabs | Kling v2 | Kling v2.5 | VEED | Stable Audio |
ultra | Gemini Flash | ElevenLabs | OmniHuman | Kling v3 (i2v) | built-in | Stable Audio |
local | Qwen3 8B | Kokoro | Wav2Lip | WAN 1.3B | Wav2Lip | MusicGen |
local-power | Qwen3 latest | Kokoro | Wav2Lip | WAN 1.3B | Wav2Lip | MusicGen |
local-light | Gemma3 4B | Piper | skip | skip | skip | skip |
Individual model keys can override any preset value. See Model Configuration for details.
Pipeline Stages
Section titled “Pipeline Stages”1. Script Generation
Section titled “1. Script Generation”An LLM generates a structured script with:
- Hook text — the attention-grabbing opening line
- Full narration — the complete voiceover script (110-160 words)
- Segments — 5-7 alternating segments of type
actor_talkingorbroll, each with timing, narration text, and a visual prompt
The script also generates a continuity brief — a text prefix encoding the video’s unified color palette, film stock, and atmospheric quality. This prefix is prepended to every downstream visual generation prompt. See Shot Design.
2. Keyframe Grid (Optional)
Section titled “2. Keyframe Grid (Optional)”When enabled, a 2x2 grid of keyframe images is generated from the b-roll segment descriptions. All 4 panels are generated in a single image, forcing visual consistency. The grid is cropped into individual keyframes that serve as reference images for image-to-video generation.
3. Multi-View Character Prep (Premium/Ultra)
Section titled “3. Multi-View Character Prep (Premium/Ultra)”At premium and ultra quality tiers, when an actor image is available, the pipeline generates 4 perspective views (front, 3/4, profile, back) of the character. These multi-angle references are passed to all downstream Seedance b-roll calls, dramatically improving character consistency across different camera angles and scenes.
This step is automatic — no configuration needed beyond selecting a premium/ultra preset and providing an actor image. See VFX Compositing for details and direct SDK usage.
4. Parallel Generation
Section titled “4. Parallel Generation”Six tasks run concurrently:
- AI Actor — Generates a portrait image matching the
presenter_lookdescription via Imagen 4 - Voiceover — Text-to-speech of the full narration (ElevenLabs, Kokoro, or Piper)
- B-Roll (x3) — Up to 3 b-roll video clips generated with cinema-grade prompts. Routes to local models (WAN, LTX), FAL (Veo, Kling), or Ken Burns fallback
- Background Music — Mood-matched music generation (Stable Audio or MusicGen)
5. Talking Heads
Section titled “5. Talking Heads”The AI actor portrait + voiceover audio segments are combined into talking-head video clips using avatar models (Kling Avatar, OmniHuman, or SadTalker/Wav2Lip locally).
6. Lip-Sync
Section titled “6. Lip-Sync”For models that don’t include built-in lip-sync, a separate lip-sync pass aligns mouth movements to audio (VEED, MuseTalk, LatentSync, or Wav2Lip).
7. Post-Production
Section titled “7. Post-Production”- Audio Mix — Voiceover + background music mixed with configurable volume levels
- Transcription — Word-level transcription via Faster Whisper for subtitle timing
- Composition — Timeline assembly interleaving talking-head and b-roll segments
- Subtitles — Burned into video with proper timing
- Hook Overlay — Animated text overlay for the hook line (first 3 seconds)
- Effects — Optional video effects pipeline
8. Output Validation
Section titled “8. Output Validation”Before final delivery, a video_output_gate probes the rendered file and asserts:
- File is not truncated (minimum 10KB)
- Duration is valid (> 0 seconds)
- Resolution meets minimum (720x1280)
- Audio track is present
If any check fails, GateError is raised with detailed diagnostics. This catches broken renders automatically — no more discovering corrupt output after delivery.
Output
Section titled “Output”The pipeline produces a single .mp4 file at 1080x1920 (9:16 vertical) with:
- Burned-in subtitles
- Hook text overlay
- Mixed audio (voice + music)
- All segments composited in timeline order
{ "final_video_path": "/tmp/fabric_final_abc123.mp4", "script": { "topic": "...", "segments": [...] }, "voiceover_path": "/tmp/fabric_vo_xyz.mp3", "actor_image_path": "/tmp/fabric_img_def.png", "broll_path_0": "/tmp/fabric_broll_0.mp4", "broll_path_1": "/tmp/fabric_broll_1.mp4", "broll_path_2": "/tmp/fabric_broll_2.mp4"}Running Locally
Section titled “Running Locally”The pipeline works fully offline with local models. No API keys required.
# Install local dependenciespip install "mlx-video @ git+https://github.com/Blaizzy/mlx-video.git" # Macpip install diffusers torch transformers accelerate # Any platformollama pull qwen3:8b # Script generation
# Run with local profilefabric run video/ai-shorts \ --input topic="The future of AI" \ --input quality=localSee Local Video & Image Models for setup details.