Local Models

Fabric supports local AI model inference alongside remote providers (OpenAI, Anthropic, Gemini). No API keys required for local models.

Ollama (LLMs + Embeddings)

Ollama runs models locally. Supports Qwen3, Llama, Mistral, DeepSeek, Gemma, Phi, and many more.

Install

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull Models

ollama pull qwen3:latest        # Qwen3 (default)
ollama pull llama3.2:latest     # Llama 3.2
ollama pull deepseek-r1:latest  # DeepSeek R1
ollama pull mistral:latest      # Mistral
ollama pull nomic-embed-text    # Embeddings

Run

ollama serve  # Starts on http://localhost:11434

Configure Fabric

# In .env
OLLAMA_ENABLED=true
OLLAMA_URL=http://localhost:11434   # default, can be omitted

Usage via API

# List available providers (should show "ollama")
curl http://localhost:3001/v1/providers

# Execute with Ollama
curl -X POST http://localhost:3001/v1/providers/execute \
  -H 'content-type: application/json' \
  -d '{
    "modality": "text",
    "model": "qwen3:latest",
    "input": {"prompt": "Explain quantum computing in one sentence"},
    "params": {"temperature": 0.7}
  }'

# Embeddings
curl -X POST http://localhost:3001/v1/providers/execute \
  -H 'content-type: application/json' \
  -d '{
    "modality": "embedding",
    "model": "nomic-embed-text",
    "input": {"text": "Hello world"}
  }'

Whisper (Audio Transcription)

For local audio transcription using whisper.cpp or faster-whisper.

Via Docker (Recommended)

Starts automatically with just infra-up. Uses faster-whisper-server with the large-v3 model.

Available at http://localhost:8080.

Manual Install (whisper.cpp)

# Build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
# Download model
bash ./models/download-ggml-model.sh large-v3
# Start server
./server -m models/ggml-large-v3.bin --port 8080

Configure Fabric

# In .env
WHISPER_URL=http://localhost:8080
# WHISPER_MODEL=large-v3           # Optional model override

Usage

curl -X POST http://localhost:3001/v1/providers/execute \
  -H 'content-type: application/json' \
  -d '{
    "modality": "audio",
    "input": {"audio_url": "https://example.com/speech.wav"},
    "params": {"language": "en"}
  }'

Piper (Text-to-Speech)

Piper is a fast, local neural text-to-speech system. Useful for voiceover generation in video workflows.

Run with Docker

docker run -d -p 5000:5000 rhasspy/piper:latest

Configure Fabric

# In .env
PIPER_URL=http://localhost:5000

Chatterbox (Voice Cloning)

Chatterbox provides local voice cloning — generate speech in a specific voice from a short audio sample.

Configure Fabric

# In .env
CHATTERBOX_URL=http://localhost:8001

ComfyUI (Image Generation)

Local Stable Diffusion image generation via ComfyUI workflows.

Configure Fabric

# In .env
COMFYUI_ENABLED=true
COMFYUI_URL=http://localhost:8188

llama.cpp (Embedded LLM)

Run a GGUF model directly inside Fabric without a separate server. Requires the llama-cpp feature.

# In .env
LLAMA_CPP_MODEL=/path/to/qwen3-8b-q5.gguf

OpenAI-Compatible Servers

Any server that implements the OpenAI chat completions API works with the OpenAI provider:

vLLM: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8000/v1
llama.cpp server: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1
LocalAI: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1
LM Studio: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:1234/v1

Set OPENAI_API_KEY to any non-empty string (the local server ignores it).

For multiple custom endpoints, use fabric.toml:

[[providers.custom_openai_endpoints]]
name = "vllm"
url = "http://localhost:8000/v1"
api_key = "no-key"
model = "mistral-7b"
tier = "basic"

[[providers.custom_openai_endpoints]]
name = "lmstudio"
url = "http://localhost:1234/v1"
api_key = "no-key"
model = "llama-3.2"
tier = "basic"

Routing Strategy

Set the server-wide routing strategy with FABRIC_ROUTING_STRATEGY:

Strategy	Behavior
`cheapest_qualified`	Cheapest model that meets requirements (default)
`local_first`	Prefer local/free providers, fall back to remote
`fastest`	Lowest latency
`best_quality`	Highest quality regardless of cost
`balanced`	Balance of cost, quality, and speed

# In .env — prefer local models, fall back to remote when needed
FABRIC_ROUTING_STRATEGY=local_first

Local-Only Mode

To disable all remote providers and run entirely on local models:

FABRIC_LOCAL_ONLY=true

This blocks requests to OpenAI, Anthropic, Gemini, FAL, and any other remote provider — even if their API keys are set. Video and other modalities without a dedicated local provider fall back to echo stubs.

E2E Testing

Use local-only mode when running end-to-end tests to guarantee zero remote API spend:

# Pull minimum required models
ollama pull qwen3:8b
ollama pull nomic-embed-text

# Start Fabric in local-only mode
FABRIC_LOCAL_ONLY=true FABRIC_ROUTING_STRATEGY=local_first fabric serve

Text routes to Ollama, images to ComfyUI (if enabled), and video/audio to echo stubs. Workflows complete without any remote API calls.

Provider Priority

When multiple providers support the same modality, Fabric routes by tier then cost. Registration order for matching:

OpenAI (if OPENAI_API_KEY set)
Anthropic (if ANTHROPIC_API_KEY set)
Google Gemini (if GOOGLE_API_KEY set)
FAL (if FAL_API_KEY set)
Ollama (if OLLAMA_ENABLED or OLLAMA_URL set)
Whisper (if WHISPER_URL set)
Piper (if PIPER_URL set)
Chatterbox (if CHATTERBOX_URL set)
ComfyUI (if COMFYUI_ENABLED set)
Echo providers (always — fallback for testing)

To force a specific provider, include "model": "qwen3:latest" in the request — the router matches the provider that advertises that model.

Cost

Local models have zero API cost. The /v1/providers/estimate endpoint returns $0.00 for Ollama, Whisper, Piper, ComfyUI, and all other local providers.