Skip to content

Local Models

Fabric supports local AI model inference alongside remote providers (OpenAI, Anthropic). No API keys required for local models.

Ollama runs models locally. Supports Qwen3, Llama, Mistral, DeepSeek, Gemma, Phi, and many more.

Terminal window
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Terminal window
ollama pull qwen3:latest # Qwen3 (default)
ollama pull llama3.2:latest # Llama 3.2
ollama pull deepseek-r1:latest # DeepSeek R1
ollama pull mistral:latest # Mistral
ollama pull nomic-embed-text # Embeddings
Terminal window
ollama serve # Starts on http://localhost:11434
Terminal window
# In .env
OLLAMA_ENABLED=true
# Or point to a remote Ollama instance:
OLLAMA_URL=http://gpu-server:11434
Terminal window
# List available providers (should show "ollama")
curl http://localhost:3001/v1/providers
# Execute with Ollama
curl -X POST http://localhost:3001/v1/providers/execute \
-H 'content-type: application/json' \
-d '{
"modality": "text",
"model": "qwen3:latest",
"input": {"prompt": "Explain quantum computing in one sentence"},
"params": {"temperature": 0.7}
}'
# Embeddings
curl -X POST http://localhost:3001/v1/providers/execute \
-H 'content-type: application/json' \
-d '{
"modality": "embedding",
"model": "nomic-embed-text",
"input": {"text": "Hello world"}
}'

For local audio transcription using whisper.cpp or faster-whisper.

Terminal window
# Build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
# Download model
bash ./models/download-ggml-model.sh large-v3
# Start server
./server -m models/ggml-large-v3.bin --port 8080
Terminal window
# In .env
WHISPER_URL=http://localhost:8080
Terminal window
curl -X POST http://localhost:3001/v1/providers/execute \
-H 'content-type: application/json' \
-d '{
"modality": "audio",
"input": {"audio_url": "https://example.com/speech.wav"},
"params": {"language": "en"}
}'

Any server that implements the OpenAI chat completions API works with the OpenAI provider:

  • vLLM: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8000/v1
  • llama.cpp server: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1
  • LocalAI: OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:8080/v1

Set OPENAI_API_KEY to any non-empty string (the local server ignores it).

When multiple providers support the same modality, Fabric routes to the first match. Registration order:

  1. OpenAI (if OPENAI_API_KEY set)
  2. Anthropic (if ANTHROPIC_API_KEY set)
  3. Ollama (if OLLAMA_ENABLED or OLLAMA_URL set)
  4. Whisper (if WHISPER_URL set)
  5. Echo providers (always, fallback for testing)

To force a specific provider, include "model": "qwen3:latest" in the request — the router will match the provider that advertises that model.

Local models have zero API cost. The /v1/providers/estimate endpoint returns $0.00 for Ollama and Whisper.