Skip to content

Local Video & Image Models

Fabric supports fully local video and image generation using open-source models. No API keys, no cloud costs, no data leaving your machine.

When generating video or images locally, Fabric tries backends in order until one succeeds:

  1. mlx-video — Apple Silicon native via MLX framework (Mac only). Fastest on M-series chips.
  2. diffusers — HuggingFace pipelines with CUDA or MPS acceleration (cross-platform).
  3. ComfyUI — If FABRIC_COMFYUI_URL is set, delegates to a ComfyUI server.
  4. Ken Burns fallback — Generates a still image and applies a zoom animation via FFmpeg.
Model IDVRAMFPSDefault FramesResolutionBackend
wan:1.3b8 GB1633 (~2s)480x832mlx-video, diffusers
wan:14b24 GB1681 (~5s)480x832mlx-video, diffusers
ltx-video8 GB2497 (~4s)768x512mlx-video, diffusers
cogvideox:2b6 GB849 (~6s)480x720diffusers only
cogvideox:5b12 GB849 (~6s)480x720diffusers only
Model IDVRAMStepsBackend
sdxl-turbo6 GB4diffusers
flux-schnell8 GB4diffusers
sd3.5-medium8 GB28diffusers
Model IDVRAMTypeBuilt-in Lip-sync
sadtalker8 GBAvatarYes
echomimic16 GBAvatarYes
hallo24 GBAvatarYes
Model IDVRAMType
wav2lip4 GBLip-sync
latentsync8 GBLip-sync
musetalk16 GBLip-sync
Terminal window
# MLX-video — native Apple Silicon, recommended
pip install "mlx-video @ git+https://github.com/Blaizzy/mlx-video.git"
# Models are downloaded automatically on first use
# Cached at: ~/.cache/mlx-models/
Terminal window
# Core dependencies
pip install diffusers torch transformers accelerate sentencepiece
# Models are downloaded from HuggingFace on first use
Terminal window
# Point to a running ComfyUI server
export FABRIC_COMFYUI_URL=http://localhost:8188
from fabric_workflow_sdk._local_video import (
generate_video,
generate_image,
generate_talking_head,
lipsync_video,
is_available,
)
# Check if any local backend is available
if is_available():
# Generate video
video_path = await generate_video(
input_dict,
"A cinematic ocean wave crashing on rocks",
model="wan:1.3b",
duration=5,
)
# Generate image
image_path = await generate_image(
input_dict,
"A sunset over mountains",
model="sdxl-turbo",
aspect_ratio="9:16",
)
# Generate talking head from portrait + audio
video_path = await generate_talking_head(
image_path="portrait.png",
audio_path="voiceover.mp3",
model="sadtalker",
)
# Lip-sync existing video to new audio
synced_path = await lipsync_video(
video_path="talking.mp4",
audio_path="new_audio.mp3",
model="wav2lip",
)

Set quality=local to use local models for the entire AI Shorts pipeline:

Terminal window
fabric run global/ai-shorts \
--input topic="The future of AI" \
--input quality=local
ProfileVideoImageTTSAvatar
localwan:1.3bsdxl-turboKokoroWav2Lip
local-powerwan:1.3bflux-schnellKokoroWav2Lip
local-lightskipsdxl-turboPiperskip

On first use, MLX models are downloaded from HuggingFace and converted to MLX format. This is a one-time operation:

Downloading and converting Wan-AI/Wan2.1-T2V-1.3B to MLX format (first time only)...
Converted T5 encoder: ~/.cache/mlx-models/Wan-AI--Wan2.1-T2V-1.3B/t5_encoder.safetensors
Converted VAE: ~/.cache/mlx-models/Wan-AI--Wan2.1-T2V-1.3B/vae.safetensors
Model ready at: ~/.cache/mlx-models/Wan-AI--Wan2.1-T2V-1.3B

Converted weights are cached at ~/.cache/mlx-models/ and reused across sessions.

Loaded diffusers pipelines are cached in memory to avoid reloading weights between generations. The cache is automatically cleaned on process exit.

Image generation supports these aspect ratios:

Aspect RatioResolutionUse Case
9:161080x1920Vertical social (TikTok, Reels)
16:91920x1080Horizontal (YouTube)
1:1 / square1024x1024Square (Instagram)
3:4768x1024Portrait