Skip to content

AI Shorts Pipeline

The AI Shorts pipeline generates production-ready short-form video (TikTok, Reels, Shorts) from a single topic string. It orchestrates script generation, AI actor creation, voiceover, b-roll video, background music, talking heads, lip-sync, subtitles, and final composition — all as a single workflow.

generate_script
generate_keyframes (opt-in)
┌─────────┼─────────────────────┐
│ │ │ │
generate_ generate_ generate_ generate_
ai_actor voiceover broll (×3) bgm
│ │ │ │
└─────────┼─────────────────────┘
merge_generation
generate_talking_heads
lipsync_talking_heads
mix_audio
transcribe_voiceover
compose_timeline
burn_subtitles
burn_hook_overlay
effects_pipeline
collect_final_output

The 6-way fork (actor, voiceover, b-roll x3, music) runs in parallel for maximum throughput. On a fast connection with remote models, a 45-second video completes in 3-5 minutes.

Terminal window
# Minimal — topic only
fabric run global/ai-shorts --input topic="Why sleep is a superpower"
# Full control
fabric run global/ai-shorts \
--input topic="AI is replacing junior developers" \
--input mood="dramatic" \
--input platform="YouTube Shorts" \
--input duration_secs=60 \
--input quality=premium \
--input use_keyframe_grid=true
ParameterTypeDefaultDescription
topicstringrequiredThe subject of the video
hookstring""Specific hook line (auto-generated if empty)
moodstring"high-energy and conversational"Emotional tone for visuals and narration
platformstring"TikTok"Target platform (affects pacing and framing)
duration_secsint45Target video duration in seconds
presenter_lookstring"confident young creator..."AI actor appearance description
visual_stylestring""Override visual aesthetic (e.g. “neon cyberpunk”)
qualitystring""Quality preset: cheap, premium, ultra, local, local-power
use_keyframe_gridboolfalseEnable 2x2 grid keyframe generation
genderstringauto-detectedVoice gender for TTS (male or female)

Quality presets control which models are used at each pipeline stage:

PresetScriptTTSAvatarB-RollLip-syncMusic
cheapGemini FlashKokoroKling v2Veo 3.1FALStable Audio
premiumGemini FlashElevenLabsKling v2Kling v2.5VEEDStable Audio
ultraGemini FlashElevenLabsOmniHumanKling v3 (i2v)built-inStable Audio
localQwen3 8BKokoroWav2LipWAN 1.3BWav2LipMusicGen
local-powerQwen3 latestKokoroWav2LipWAN 1.3BWav2LipMusicGen
local-lightGemma3 4BPiperskipskipskipskip

Individual model keys can override any preset value. See Model Configuration for details.

An LLM generates a structured script with:

  • Hook text — the attention-grabbing opening line
  • Full narration — the complete voiceover script (110-160 words)
  • Segments — 5-7 alternating segments of type actor_talking or broll, each with timing, narration text, and a visual prompt

The script also generates a continuity brief — a text prefix encoding the video’s unified color palette, film stock, and atmospheric quality. This prefix is prepended to every downstream visual generation prompt. See Shot Design.

When enabled, a 2x2 grid of keyframe images is generated from the b-roll segment descriptions. All 4 panels are generated in a single image, forcing visual consistency. The grid is cropped into individual keyframes that serve as reference images for image-to-video generation.

Six tasks run concurrently:

  • AI Actor — Generates a portrait image matching the presenter_look description via Imagen 4
  • Voiceover — Text-to-speech of the full narration (ElevenLabs, Kokoro, or Piper)
  • B-Roll (x3) — Up to 3 b-roll video clips generated with cinema-grade prompts. Routes to local models (WAN, LTX), FAL (Veo, Kling), or Ken Burns fallback
  • Background Music — Mood-matched music generation (Stable Audio or MusicGen)

The AI actor portrait + voiceover audio segments are combined into talking-head video clips using avatar models (Kling Avatar, OmniHuman, or SadTalker/Wav2Lip locally).

For models that don’t include built-in lip-sync, a separate lip-sync pass aligns mouth movements to audio (VEED, MuseTalk, LatentSync, or Wav2Lip).

  • Audio Mix — Voiceover + background music mixed with configurable volume levels
  • Transcription — Word-level transcription via Faster Whisper for subtitle timing
  • Composition — Timeline assembly interleaving talking-head and b-roll segments
  • Subtitles — Burned into video with proper timing
  • Hook Overlay — Animated text overlay for the hook line (first 3 seconds)
  • Effects — Optional video effects pipeline

The pipeline produces a single .mp4 file at 1080x1920 (9:16 vertical) with:

  • Burned-in subtitles
  • Hook text overlay
  • Mixed audio (voice + music)
  • All segments composited in timeline order
{
"final_video_path": "/tmp/fabric_final_abc123.mp4",
"script": { "topic": "...", "segments": [...] },
"voiceover_path": "/tmp/fabric_vo_xyz.mp3",
"actor_image_path": "/tmp/fabric_img_def.png",
"broll_path_0": "/tmp/fabric_broll_0.mp4",
"broll_path_1": "/tmp/fabric_broll_1.mp4",
"broll_path_2": "/tmp/fabric_broll_2.mp4"
}

The pipeline works fully offline with local models. No API keys required.

Terminal window
# Install local dependencies
pip install "mlx-video @ git+https://github.com/Blaizzy/mlx-video.git" # Mac
pip install diffusers torch transformers accelerate # Any platform
ollama pull qwen3:8b # Script generation
# Run with local profile
fabric run global/ai-shorts \
--input topic="The future of AI" \
--input quality=local

See Local Video & Image Models for setup details.