Local Models
Fabric supports local AI model inference alongside remote providers (OpenAI, Anthropic, Gemini). No API keys required for local models.
Ollama (LLMs + Embeddings)
Section titled “Ollama (LLMs + Embeddings)”Ollama runs models locally. Supports Qwen3, Llama, Mistral, DeepSeek, Gemma, Phi, and many more.
Install
Section titled “Install”# macOSbrew install ollama
# Linuxcurl -fsSL https://ollama.com/install.sh | shPull Models
Section titled “Pull Models”ollama pull qwen3:latest # Qwen3 (default)ollama pull llama3.2:latest # Llama 3.2ollama pull deepseek-r1:latest # DeepSeek R1ollama pull mistral:latest # Mistralollama pull nomic-embed-text # Embeddingsollama serve # Starts on http://localhost:11434Configure Fabric
Section titled “Configure Fabric”# In .envOLLAMA_ENABLED=trueOLLAMA_URL=http://localhost:11434 # default, can be omittedUsage via API
Section titled “Usage via API”# List available providers (should show "ollama")curl http://localhost:3001/v1/providers
# Execute with Ollamacurl -X POST http://localhost:3001/v1/providers/execute \ -H 'content-type: application/json' \ -d '{ "modality": "text", "model": "qwen3:latest", "input": {"prompt": "Explain quantum computing in one sentence"}, "params": {"temperature": 0.7} }'
# Embeddingscurl -X POST http://localhost:3001/v1/providers/execute \ -H 'content-type: application/json' \ -d '{ "modality": "embedding", "model": "nomic-embed-text", "input": {"text": "Hello world"} }'Whisper (Audio Transcription)
Section titled “Whisper (Audio Transcription)”For local audio transcription using whisper.cpp or faster-whisper.
Via Docker (Recommended)
Section titled “Via Docker (Recommended)”Starts automatically with just infra-up. Uses faster-whisper-server with the large-v3 model.
Available at http://localhost:8080.
Manual Install (whisper.cpp)
Section titled “Manual Install (whisper.cpp)”# Build whisper.cppgit clone https://github.com/ggerganov/whisper.cppcd whisper.cpp && make# Download modelbash ./models/download-ggml-model.sh large-v3# Start server./server -m models/ggml-large-v3.bin --port 8080Configure Fabric
Section titled “Configure Fabric”# In .envWHISPER_URL=http://localhost:8080# WHISPER_MODEL=large-v3 # Optional model overridecurl -X POST http://localhost:3001/v1/providers/execute \ -H 'content-type: application/json' \ -d '{ "modality": "audio", "input": {"audio_url": "https://example.com/speech.wav"}, "params": {"language": "en"} }'Piper (Text-to-Speech)
Section titled “Piper (Text-to-Speech)”Piper is a fast, local neural text-to-speech system. Useful for voiceover generation in video workflows.
Run with Docker
Section titled “Run with Docker”docker run -d -p 5000:5000 rhasspy/piper:latestConfigure Fabric
Section titled “Configure Fabric”# In .envPIPER_URL=http://localhost:5000Chatterbox (Voice Cloning)
Section titled “Chatterbox (Voice Cloning)”Chatterbox provides local voice cloning — generate speech in a specific voice from a short audio sample.
Configure Fabric
Section titled “Configure Fabric”# In .envCHATTERBOX_URL=http://localhost:8001ComfyUI (Image Generation)
Section titled “ComfyUI (Image Generation)”Local Stable Diffusion image generation via ComfyUI workflows.
Configure Fabric
Section titled “Configure Fabric”# In .envCOMFYUI_ENABLED=trueCOMFYUI_URL=http://localhost:8188llama.cpp (Embedded LLM)
Section titled “llama.cpp (Embedded LLM)”Run a GGUF model directly inside Fabric without a separate server. Requires the llama-cpp feature.
# In .envLLAMA_CPP_MODEL=/path/to/qwen3-8b-q5.ggufOpenAI-Compatible Servers
Section titled “OpenAI-Compatible Servers”Any server that implements the OpenAI chat completions API works with the OpenAI provider:
- vLLM:
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8000/v1 - llama.cpp server:
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1 - LocalAI:
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1 - LM Studio:
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:1234/v1
Set OPENAI_API_KEY to any non-empty string (the local server ignores it).
For multiple custom endpoints, use fabric.toml:
[[providers.custom_openai_endpoints]]name = "vllm"url = "http://localhost:8000/v1"api_key = "no-key"model = "mistral-7b"tier = "basic"
[[providers.custom_openai_endpoints]]name = "lmstudio"url = "http://localhost:1234/v1"api_key = "no-key"model = "llama-3.2"tier = "basic"Routing Strategy
Section titled “Routing Strategy”Set the server-wide routing strategy with FABRIC_ROUTING_STRATEGY:
| Strategy | Behavior |
|---|---|
cheapest_qualified | Cheapest model that meets requirements (default) |
local_first | Prefer local/free providers, fall back to remote |
fastest | Lowest latency |
best_quality | Highest quality regardless of cost |
balanced | Balance of cost, quality, and speed |
# In .env — prefer local models, fall back to remote when neededFABRIC_ROUTING_STRATEGY=local_firstLocal-Only Mode
Section titled “Local-Only Mode”To disable all remote providers and run entirely on local models:
FABRIC_LOCAL_ONLY=trueThis blocks requests to OpenAI, Anthropic, Gemini, FAL, and any other remote provider — even if their API keys are set. Video and other modalities without a dedicated local provider fall back to echo stubs.
E2E Testing
Section titled “E2E Testing”Use local-only mode when running end-to-end tests to guarantee zero remote API spend:
# Pull minimum required modelsollama pull qwen3:8bollama pull nomic-embed-text
# Start Fabric in local-only modeFABRIC_LOCAL_ONLY=true FABRIC_ROUTING_STRATEGY=local_first fabric serveText routes to Ollama, images to ComfyUI (if enabled), and video/audio to echo stubs. Workflows complete without any remote API calls.
Provider Priority
Section titled “Provider Priority”When multiple providers support the same modality, Fabric routes by tier then cost. Registration order for matching:
- OpenAI (if
OPENAI_API_KEYset) - Anthropic (if
ANTHROPIC_API_KEYset) - Google Gemini (if
GOOGLE_API_KEYset) - FAL (if
FAL_API_KEYset) - Ollama (if
OLLAMA_ENABLEDorOLLAMA_URLset) - Whisper (if
WHISPER_URLset) - Piper (if
PIPER_URLset) - Chatterbox (if
CHATTERBOX_URLset) - ComfyUI (if
COMFYUI_ENABLEDset) - Echo providers (always — fallback for testing)
To force a specific provider, include "model": "qwen3:latest" in the request — the router matches the provider that advertises that model.
Local models have zero API cost. The /v1/providers/estimate endpoint returns $0.00 for Ollama, Whisper, Piper, ComfyUI, and all other local providers.