AI Shorts Pipeline

The AI Shorts pipeline generates production-ready short-form video (TikTok, Reels, Shorts) from a single topic string. It orchestrates script generation, AI actor creation, voiceover, b-roll video, background music, talking heads, lip-sync, subtitles, and final composition — all as a single workflow.

Pipeline Architecture

                        generate_script
                              │
                        generate_keyframes  (opt-in)
                              │
                    ┌─────────┼─────────────────────┐
                    │         │         │            │
              generate_   generate_  generate_   generate_
              ai_actor  voiceover  broll (×3)     bgm
                    │         │         │            │
                    └─────────┼─────────────────────┘
                              │
                      merge_generation
                              │
                    generate_talking_heads
                              │
                    lipsync_talking_heads
                              │
                         mix_audio
                              │
                    transcribe_voiceover
                              │
                      compose_timeline
                              │
                      burn_subtitles
                              │
                    burn_hook_overlay
                              │
                       effects_pipeline
                              │
                    collect_final_output

The 6-way fork (actor, voiceover, b-roll x3, music) runs in parallel for maximum throughput. On a fast connection with remote models, a 45-second video completes in 3-5 minutes.

Quick Start

# Minimal — topic only
fabric run global/ai-shorts --input topic="Why sleep is a superpower"

# Full control
fabric run global/ai-shorts \
  --input topic="AI is replacing junior developers" \
  --input mood="dramatic" \
  --input platform="YouTube Shorts" \
  --input duration_secs=60 \
  --input quality=premium \
  --input use_keyframe_grid=true

from fabric_platform import FabricClient

fabric = FabricClient()
run = fabric.run_workflow("global/ai-shorts", input={
    "topic": "Why sleep is a superpower",
    "mood": "high-energy and conversational",
    "platform": "TikTok",
    "duration_secs": 45,
    "quality": "premium",
})
result = fabric.wait_for_run(run["id"])
print(result["output"]["final_video_path"])

import { FabricClient } from "@fabric-platform/sdk";
const fabric = new FabricClient();

const run = await fabric.workflows.runs.submitRun({
  workflowSlug: "global/ai-shorts",
  input: {
    topic: "Why sleep is a superpower",
    mood: "high-energy and conversational",
    platform: "TikTok",
    duration_secs: 45,
    quality: "premium",
  },
});
const result = await fabric.workflows.runs.waitForRun(run.id);

Input Parameters

Parameter	Type	Default	Description
`topic`	`string`	required	The subject of the video
`hook`	`string`	`""`	Specific hook line (auto-generated if empty)
`mood`	`string`	`"high-energy and conversational"`	Emotional tone for visuals and narration
`platform`	`string`	`"TikTok"`	Target platform (affects pacing and framing)
`duration_secs`	`int`	`45`	Target video duration in seconds
`presenter_look`	`string`	`"confident young creator..."`	AI actor appearance description
`visual_style`	`string`	`""`	Override visual aesthetic (e.g. “neon cyberpunk”)
`quality`	`string`	`""`	Quality preset: `cheap`, `premium`, `ultra`, `local`, `local-power`
`use_keyframe_grid`	`bool`	`false`	Enable 2x2 grid keyframe generation
`gender`	`string`	auto-detected	Voice gender for TTS (`male` or `female`)

Quality Presets

Quality presets control which models are used at each pipeline stage:

Preset	Script	TTS	Avatar	B-Roll	Lip-sync	Music
`cheap`	Gemini Flash	Kokoro	Kling v2	Veo 3.1	FAL	Stable Audio
`premium`	Gemini Flash	ElevenLabs	Kling v2	Kling v2.5	VEED	Stable Audio
`ultra`	Gemini Flash	ElevenLabs	OmniHuman	Kling v3 (i2v)	built-in	Stable Audio
`local`	Qwen3 8B	Kokoro	Wav2Lip	WAN 1.3B	Wav2Lip	MusicGen
`local-power`	Qwen3 latest	Kokoro	Wav2Lip	WAN 1.3B	Wav2Lip	MusicGen
`local-light`	Gemma3 4B	Piper	skip	skip	skip	skip

Individual model keys can override any preset value. See Model Configuration for details.

Pipeline Stages

1. Script Generation

An LLM generates a structured script with:

Hook text — the attention-grabbing opening line
Full narration — the complete voiceover script (110-160 words)
Segments — 5-7 alternating segments of type actor_talking or broll, each with timing, narration text, and a visual prompt

The script also generates a continuity brief — a text prefix encoding the video’s unified color palette, film stock, and atmospheric quality. This prefix is prepended to every downstream visual generation prompt. See Shot Design.

2. Keyframe Grid (Optional)

When enabled, a 2x2 grid of keyframe images is generated from the b-roll segment descriptions. All 4 panels are generated in a single image, forcing visual consistency. The grid is cropped into individual keyframes that serve as reference images for image-to-video generation.

3. Parallel Generation

Six tasks run concurrently:

AI Actor — Generates a portrait image matching the presenter_look description via Imagen 4
Voiceover — Text-to-speech of the full narration (ElevenLabs, Kokoro, or Piper)
B-Roll (x3) — Up to 3 b-roll video clips generated with cinema-grade prompts. Routes to local models (WAN, LTX), FAL (Veo, Kling), or Ken Burns fallback
Background Music — Mood-matched music generation (Stable Audio or MusicGen)

4. Talking Heads

The AI actor portrait + voiceover audio segments are combined into talking-head video clips using avatar models (Kling Avatar, OmniHuman, or SadTalker/Wav2Lip locally).

5. Lip-Sync

For models that don’t include built-in lip-sync, a separate lip-sync pass aligns mouth movements to audio (VEED, MuseTalk, LatentSync, or Wav2Lip).

6. Post-Production

Audio Mix — Voiceover + background music mixed with configurable volume levels
Transcription — Word-level transcription via Faster Whisper for subtitle timing
Composition — Timeline assembly interleaving talking-head and b-roll segments
Subtitles — Burned into video with proper timing
Hook Overlay — Animated text overlay for the hook line (first 3 seconds)
Effects — Optional video effects pipeline

Output

The pipeline produces a single .mp4 file at 1080x1920 (9:16 vertical) with:

Burned-in subtitles
Hook text overlay
Mixed audio (voice + music)
All segments composited in timeline order

{
  "final_video_path": "/tmp/fabric_final_abc123.mp4",
  "script": { "topic": "...", "segments": [...] },
  "voiceover_path": "/tmp/fabric_vo_xyz.mp3",
  "actor_image_path": "/tmp/fabric_img_def.png",
  "broll_path_0": "/tmp/fabric_broll_0.mp4",
  "broll_path_1": "/tmp/fabric_broll_1.mp4",
  "broll_path_2": "/tmp/fabric_broll_2.mp4"
}

Running Locally

The pipeline works fully offline with local models. No API keys required.

# Install local dependencies
pip install "mlx-video @ git+https://github.com/Blaizzy/mlx-video.git"  # Mac
pip install diffusers torch transformers accelerate  # Any platform
ollama pull qwen3:8b  # Script generation

# Run with local profile
fabric run global/ai-shorts \
  --input topic="The future of AI" \
  --input quality=local

See Local Video & Image Models for setup details.