AI Shorts Pipeline

The AI Shorts pipeline generates production-ready short-form video (TikTok, Reels, Shorts) from a single topic string. It orchestrates script generation, AI actor creation, voiceover, b-roll video, background music, talking heads, lip-sync, subtitles, and final composition — all as a single workflow.

Quick Start

# Minimal — topic only (faceless, 45s, educational)
fabric run video/ai-shorts --input topic="Why cats always land on their feet"

# With a talking-head presenter
fabric run video/ai-shorts \
  --input topic="AI in healthcare" \
  --input actor_image_url="https://example.com/presenter.png" \
  --input voice=narrator

# Full control
fabric run video/ai-shorts \
  --input topic="5 morning habits that changed my life" \
  --input style=listicle \
  --input duration_seconds=30 \
  --input audience=professionals \
  --input quality=premium

from fabric_platform import FabricClient

fabric = FabricClient()

# Faceless short
run = fabric.ai_shorts("Why sleep is a superpower")

# Talking-head with custom voice
run = fabric.ai_shorts(
    "AI is replacing junior developers",
    actor_image_url="https://example.com/presenter.png",
    voice="energetic-male",
    style="promotional",
    duration_seconds=60,
)
print(run["output"]["title"])
print(run["output"]["video_asset_id"])

import { FabricClient } from "@fabric-platform/sdk";
const fabric = new FabricClient();

const run = await fabric.workflows.runs.submit("video/ai-shorts", {
  input: {
    topic: "Why sleep is a superpower",
    actor_image_url: "https://example.com/presenter.png",
    voice: "warm",
    style: "storytelling",
    duration_seconds: 45,
  },
});
const result = await fabric.workflows.runs.waitForRun(run.id);

Pipeline Architecture

                        generate_script
                              │
                        generate_keyframes  (opt-in)
                              │
                  multiview_character_prep  (premium/ultra)
                              │
                    ┌─────────┼─────────────────────┐
                    │         │         │            │
              generate_   generate_  generate_   generate_
              ai_actor  voiceover  broll (×3)     bgm
                    │         │         │            │
                    └─────────┼─────────────────────┘
                              │
                      merge_generation
                              │
                    generate_talking_heads
                              │
                    lipsync_talking_heads
                              │
                         mix_audio
                              │
                    transcribe_voiceover
                              │
                      compose_timeline
                              │
                      burn_subtitles
                              │
                    burn_hook_overlay
                              │
                       effects_pipeline
                              │
                    collect_final_output

The 6-way fork (actor, voiceover, b-roll x3, music) runs in parallel for maximum throughput. On a fast connection with remote models, a 45-second video completes in 3-5 minutes.

Input Parameters

Simple inputs

Parameter	Type	Default	Description
`topic`	`string`	required	The subject of the video
`actor_image_url`	`string`	`null`	Public URL to a portrait image. When set → talking-head with lip sync. When omitted → faceless b-roll video.
`voice`	`string`	`"alloy"`	TTS voice style (`narrator`, `warm`, `energetic-male`, `energetic-female`, `deep`, `friendly`)
`duration_seconds`	`int`	`45`	Target duration (15-120 seconds)
`style`	`string`	`"educational"`	`educational`, `promotional`, `storytelling`, `listicle` — maps to `mood` internally
`audience`	`string`	`"general"`	Target audience: `general`, `teens`, `professionals`, `technical`, `kids`
`platforms`	`string[]`	`[]`	Platform export slugs: `tiktok`, `instagram`, `youtube_shorts`, `youtube`, `linkedin`. When set, output includes resized variants with platform-specific metadata.

Advanced overrides

These bypass the simple inputs above for full control:

Parameter	Type	Default	Description
`mood`	`string`	derived from `style`	Direct mood/tone override (e.g. “high-energy and conversational”)
`hook`	`string`	`""`	Specific hook line (auto-generated if empty)
`platform`	`string`	`"TikTok"`	Target platform (affects pacing and framing)
`presenter_look`	`string`	`""`	AI actor appearance description
`visual_style`	`string`	`""`	Override visual aesthetic (e.g. “neon cyberpunk”)
`quality`	`string`	`""`	Quality preset: `budget`, `standard`, `premium`, `ultra`, `local`, `local-power`
`use_keyframe_grid`	`bool`	`false`	Enable 2x2 grid keyframe generation

Quality Presets

Quality presets control which models are used at each pipeline stage:

Preset	Script	TTS	Avatar	B-Roll	Lip-sync	Music
`cheap`	Gemini Flash	Kokoro	Kling v2	Veo 3.1	FAL	Stable Audio
`premium`	Gemini Flash	ElevenLabs	Kling v2	Kling v2.5	VEED	Stable Audio
`ultra`	Gemini Flash	ElevenLabs	OmniHuman	Kling v3 (i2v)	built-in	Stable Audio
`local`	Qwen3 8B	Kokoro	Wav2Lip	WAN 1.3B	Wav2Lip	MusicGen
`local-power`	Qwen3 latest	Kokoro	Wav2Lip	WAN 1.3B	Wav2Lip	MusicGen
`local-light`	Gemma3 4B	Piper	skip	skip	skip	skip

Individual model keys can override any preset value. See Model Configuration for details.

Pipeline Stages

1. Script Generation

An LLM generates a structured script with:

Hook text — the attention-grabbing opening line
Full narration — the complete voiceover script (110-160 words)
Segments — 5-7 alternating segments of type actor_talking or broll, each with timing, narration text, and a visual prompt

The script also generates a continuity brief — a text prefix encoding the video’s unified color palette, film stock, and atmospheric quality. This prefix is prepended to every downstream visual generation prompt. See Shot Design.

2. Keyframe Grid (Optional)

When enabled, a 2x2 grid of keyframe images is generated from the b-roll segment descriptions. All 4 panels are generated in a single image, forcing visual consistency. The grid is cropped into individual keyframes that serve as reference images for image-to-video generation.

3. Multi-View Character Prep (Premium/Ultra)

At premium and ultra quality tiers, when an actor image is available, the pipeline generates 4 perspective views (front, 3/4, profile, back) of the character. These multi-angle references are passed to all downstream Seedance b-roll calls, dramatically improving character consistency across different camera angles and scenes.

This step is automatic — no configuration needed beyond selecting a premium/ultra preset and providing an actor image. See VFX Compositing for details and direct SDK usage.

4. Parallel Generation

Six tasks run concurrently:

AI Actor — Generates a portrait image matching the presenter_look description via Imagen 4
Voiceover — Text-to-speech of the full narration (ElevenLabs, Kokoro, or Piper)
B-Roll (x3) — Up to 3 b-roll video clips generated with cinema-grade prompts. Routes to local models (WAN, LTX), FAL (Veo, Kling), or Ken Burns fallback
Background Music — Mood-matched music generation (Stable Audio or MusicGen)

5. Talking Heads

The AI actor portrait + voiceover audio segments are combined into talking-head video clips using avatar models (Kling Avatar, OmniHuman, or SadTalker/Wav2Lip locally).

6. Lip-Sync

For models that don’t include built-in lip-sync, a separate lip-sync pass aligns mouth movements to audio (VEED, MuseTalk, LatentSync, or Wav2Lip).

7. Post-Production

Audio Mix — Voiceover + background music mixed with configurable volume levels
Transcription — Word-level transcription via Faster Whisper for subtitle timing
Composition — Timeline assembly interleaving talking-head and b-roll segments
Subtitles — Burned into video with proper timing
Hook Overlay — Animated text overlay for the hook line (first 3 seconds)
Effects — Optional video effects pipeline

8. Output Validation

Before final delivery, a video_output_gate probes the rendered file and asserts:

File is not truncated (minimum 10KB)
Duration is valid (> 0 seconds)
Resolution meets minimum (720x1280)
Audio track is present

If any check fails, GateError is raised with detailed diagnostics. This catches broken renders automatically — no more discovering corrupt output after delivery.

Output

The pipeline produces a single .mp4 file at 1080x1920 (9:16 vertical) with:

Burned-in subtitles
Hook text overlay
Mixed audio (voice + music)
All segments composited in timeline order

{
  "final_video_path": "/tmp/fabric_final_abc123.mp4",
  "script": { "topic": "...", "segments": [...] },
  "voiceover_path": "/tmp/fabric_vo_xyz.mp3",
  "actor_image_path": "/tmp/fabric_img_def.png",
  "broll_path_0": "/tmp/fabric_broll_0.mp4",
  "broll_path_1": "/tmp/fabric_broll_1.mp4",
  "broll_path_2": "/tmp/fabric_broll_2.mp4"
}

Running Locally

The pipeline works fully offline with local models. No API keys required.

# Install local dependencies
pip install "mlx-video @ git+https://github.com/Blaizzy/mlx-video.git"  # Mac
pip install diffusers torch transformers accelerate  # Any platform
ollama pull qwen3:8b  # Script generation

# Run with local profile
fabric run video/ai-shorts \
  --input topic="The future of AI" \
  --input quality=local

See Local Video & Image Models for setup details.