Skip to content

Local Models

Fabric supports local AI model inference alongside remote providers (OpenAI, Anthropic, Gemini). No API keys required for local models.

Ollama runs models locally. Supports Qwen3, Llama, Mistral, DeepSeek, Gemma, Phi, and many more.

Terminal window
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Terminal window
ollama pull qwen3:latest # Qwen3 (default)
ollama pull llama3.2:latest # Llama 3.2
ollama pull deepseek-r1:latest # DeepSeek R1
ollama pull mistral:latest # Mistral
ollama pull nomic-embed-text # Embeddings
Terminal window
ollama serve # Starts on http://localhost:11434
Terminal window
# In .env
OLLAMA_ENABLED=true
OLLAMA_URL=http://localhost:11434 # default, can be omitted
Terminal window
# List available providers (should show "ollama")
curl http://localhost:3001/v1/providers
# Execute with Ollama
curl -X POST http://localhost:3001/v1/providers/execute \
-H 'content-type: application/json' \
-d '{
"modality": "text",
"model": "qwen3:latest",
"input": {"prompt": "Explain quantum computing in one sentence"},
"params": {"temperature": 0.7}
}'
# Embeddings
curl -X POST http://localhost:3001/v1/providers/execute \
-H 'content-type: application/json' \
-d '{
"modality": "embedding",
"model": "nomic-embed-text",
"input": {"text": "Hello world"}
}'

For local audio transcription using whisper.cpp or faster-whisper.

Starts automatically with just infra-up. Uses faster-whisper-server with the large-v3 model.

Available at http://localhost:8080.

Terminal window
# Build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
# Download model
bash ./models/download-ggml-model.sh large-v3
# Start server
./server -m models/ggml-large-v3.bin --port 8080
Terminal window
# In .env
WHISPER_URL=http://localhost:8080
# WHISPER_MODEL=large-v3 # Optional model override
Terminal window
curl -X POST http://localhost:3001/v1/providers/execute \
-H 'content-type: application/json' \
-d '{
"modality": "audio",
"input": {"audio_url": "https://example.com/speech.wav"},
"params": {"language": "en"}
}'

Piper is a fast, local neural text-to-speech system. Useful for voiceover generation in video workflows.

Terminal window
docker run -d -p 5000:5000 rhasspy/piper:latest
Terminal window
# In .env
PIPER_URL=http://localhost:5000

Chatterbox provides local voice cloning — generate speech in a specific voice from a short audio sample.

Terminal window
# In .env
CHATTERBOX_URL=http://localhost:8001

Local Stable Diffusion image generation via ComfyUI workflows.

Terminal window
# In .env
COMFYUI_ENABLED=true
COMFYUI_URL=http://localhost:8188

Run a GGUF model directly inside Fabric without a separate server. Requires the llama-cpp feature.

Terminal window
# In .env
LLAMA_CPP_MODEL=/path/to/qwen3-8b-q5.gguf

Any server that implements the OpenAI chat completions API works with the OpenAI provider:

  • vLLM: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8000/v1
  • llama.cpp server: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1
  • LocalAI: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1
  • LM Studio: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:1234/v1

Set OPENAI_API_KEY to any non-empty string (the local server ignores it).

For multiple custom endpoints, use fabric.toml:

[[providers.custom_openai_endpoints]]
name = "vllm"
url = "http://localhost:8000/v1"
api_key = "no-key"
model = "mistral-7b"
tier = "basic"
[[providers.custom_openai_endpoints]]
name = "lmstudio"
url = "http://localhost:1234/v1"
api_key = "no-key"
model = "llama-3.2"
tier = "basic"

Set the server-wide routing strategy with FABRIC_ROUTING_STRATEGY:

StrategyBehavior
cheapest_qualifiedCheapest model that meets requirements (default)
local_firstPrefer local/free providers, fall back to remote
fastestLowest latency
best_qualityHighest quality regardless of cost
balancedBalance of cost, quality, and speed
Terminal window
# In .env — prefer local models, fall back to remote when needed
FABRIC_ROUTING_STRATEGY=local_first

To disable all remote providers and run entirely on local models:

Terminal window
FABRIC_LOCAL_ONLY=true

This blocks requests to OpenAI, Anthropic, Gemini, FAL, and any other remote provider — even if their API keys are set. Video and other modalities without a dedicated local provider fall back to echo stubs.

Use local-only mode when running end-to-end tests to guarantee zero remote API spend:

Terminal window
# Pull minimum required models
ollama pull qwen3:8b
ollama pull nomic-embed-text
# Start Fabric in local-only mode
FABRIC_LOCAL_ONLY=true FABRIC_ROUTING_STRATEGY=local_first fabric serve

Text routes to Ollama, images to ComfyUI (if enabled), and video/audio to echo stubs. Workflows complete without any remote API calls.

When multiple providers support the same modality, Fabric routes by tier then cost. Registration order for matching:

  1. OpenAI (if OPENAI_API_KEY set)
  2. Anthropic (if ANTHROPIC_API_KEY set)
  3. Google Gemini (if GOOGLE_API_KEY set)
  4. FAL (if FAL_API_KEY set)
  5. Ollama (if OLLAMA_ENABLED or OLLAMA_URL set)
  6. Whisper (if WHISPER_URL set)
  7. Piper (if PIPER_URL set)
  8. Chatterbox (if CHATTERBOX_URL set)
  9. ComfyUI (if COMFYUI_ENABLED set)
  10. Echo providers (always — fallback for testing)

To force a specific provider, include "model": "qwen3:latest" in the request — the router matches the provider that advertises that model.

Local models have zero API cost. The /v1/providers/estimate endpoint returns $0.00 for Ollama, Whisper, Piper, ComfyUI, and all other local providers.