Observability & Debugging
Fabric uses the Rust tracing ecosystem for structured logging. Every significant control-plane action — from inbound webhook triggers through subprocess lifecycle — emits structured log events with correlation fields (run_id, org_id, workflow, sink_id).
Log Levels
Section titled “Log Levels”| Level | What you see |
|---|---|
error | Workflow failures, subprocess crashes, bootstrap failures |
warn | Verification failures, missing sinks, stale recovery triggers, pool pressure |
info | Trigger accepted, subprocess spawned/exited, workflow completed, startup banners |
debug | Sink path lookups, heartbeat ticks, internal state transitions |
trace | Detailed SQL queries, HTTP request/response bodies (very verbose) |
Configuring Log Output
Section titled “Configuring Log Output”Set the RUST_LOG environment variable before starting the server:
# Standard operation (recommended default)RUST_LOG=info fabric serve
# Debug the trigger → execution pipelineRUST_LOG=info,fabric_api::routes::sinks=debug,fabric_events::sink=debug,fabric_engine=debug fabric serve
# Debug only the subprocess lifecycleRUST_LOG=info,fabric_engine::client::execution=debug fabric serve
# Everything (very verbose — development only)RUST_LOG=debug fabric serveThe filter syntax follows tracing-subscriber EnvFilter rules:
info— global info levelfabric_engine=debug— debug for a specific cratefabric_api::routes::sinks=trace— trace for a specific module
Debugging Workflow Runs
Section titled “Debugging Workflow Runs”The Trigger → Execution Pipeline
Section titled “The Trigger → Execution Pipeline”When a client (e.g., Socialsite) submits a run via webhook, the request flows through:
POST /v1/sinks/trigger/:org_id/*path → SinkRegistry.trigger_http() — path lookup in memory index → HttpSink.handle_trigger() — verification + payload merge → EngineSinkExecutor.execute() — bridge to engine → FabricWorkflowClient.submit() — DB insert + subprocess spawn → Python DurableEngine.run() — actual workflow executionEach stage logs at info level on success and warn/error on failure.
Common Failure Scenarios
Section titled “Common Failure Scenarios”Webhook never arrives
Section titled “Webhook never arrives”Symptom: No “Incoming HTTP sink trigger” log.
Check:
- Is the server actually running on the expected port? Look for the startup line:
Fabric HTTP on <addr> - Did the port fall back? Look for
PORT ... IS IN USE — fabric fell back to port ... - Is the client hitting the correct URL (
/v1/sinks/trigger/{org_id}/{path})?
Sink path not found
Section titled “Sink path not found”Symptom: No sink found — path not in index warning with registered_count=0.
Cause: Sink registrations weren’t loaded from the database on startup.
Check:
- Look for:
Loaded sink registrations from databaseat startup (with count) - If missing or
n=0: no registrations exist in theevent_sink_registrationstable for this org - Re-register the sink via
POST /v1/organizations/:org_id/sinks
Sink path mismatch
Section titled “Sink path mismatch”Symptom: No sink found — path not in index with registered_count > 0 and listed paths don’t match.
Cause: The path in the webhook URL doesn’t match the registered sink path exactly.
Check: The log prints all registered paths — compare against the incoming path. Common issues:
- Trailing slash mismatch (
/hooks/socialsitevs/hooks/socialsite/) - Case sensitivity (
/hooks/Socialsitevs/hooks/socialsite)
Subprocess never exits (run stuck in “running”)
Section titled “Subprocess never exits (run stuck in “running”)”Symptom: “Subprocess spawned” log appears, heartbeat ticks continue, but no “Subprocess exited” log.
Cause: The Python process is blocked — usually on a network call (API timeout, unreachable local model).
Diagnose:
# Find the stuck Python processps aux | grep DurableEngine
# Check what it's doinglsof -p <PID> # open files/sockets
# Check the run's updated_at in the databasepsql -c "SELECT id, status, updated_at, error FROM fabric_workflow_runs WHERE status = 'running' ORDER BY updated_at DESC LIMIT 5;"Fix: Kill the stuck process — the heartbeat will stop, and after 120s the stale recovery sweep will mark it failed.
Subprocess exits immediately with error
Section titled “Subprocess exits immediately with error”Symptom: “Subprocess exited” with exit_ok=false shortly after spawn.
Check: The error is captured from stderr and stored in the run record:
psql -c "SELECT id, workflow_name, error FROM fabric_workflow_runs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 5;"Common causes:
- Python not found (no
.venv/bin/pythonand nopython3on PATH) - Missing Python dependency (
ModuleNotFoundError) - Workflow not in registry (
Workflow not found: <name>)
Subprocess budget exhausted
Section titled “Subprocess budget exhausted”Symptom: Run immediately fails with “Subprocess connection budget exhausted”.
Cause: Too many workflows running concurrently. Default limit is max_concurrent_workflows (20) × subprocess_pool_per (2) = 40 active DB connections for subprocesses.
Check:
psql -c "SELECT count(*) FROM fabric_workflow_runs WHERE status = 'running';"Fix: Wait for running workflows to complete, or increase DATABASE_MAX_CONCURRENT_WORKFLOWS.
Monitoring in Production
Section titled “Monitoring in Production”Key Metrics
Section titled “Key Metrics”Fabric exposes Prometheus metrics at the /metrics endpoint:
| Metric | What it tells you |
|---|---|
fabric_workflow_runs_total | Total runs by status |
fabric_pool_bg_acquires_total | Background pool usage |
fabric_pool_bg_exhausted_total | Pool budget exhaustion events |
fabric_circuit_trips_total | Circuit breaker activations |
fabric_cfs_throttled_usec | CPU throttling (containers) |
Health Check
Section titled “Health Check”curl http://localhost:3001/healthPool Health Logger
Section titled “Pool Health Logger”When the pool is degraded, Fabric logs a health snapshot every 60 seconds:
PoolService health (60s window) total_bg_acquires=42 failed_bg_acquires=3 bg_budget_exhausted=1 subprocess_budget_exhausted=0 circuit_trips=0 pressure=ElevatedPressure levels: Normal → Elevated → Critical. At Critical, background tasks (maintenance, cron, asset reaping) are paused to preserve connections for HTTP handlers and active workflows.
Event Stream for Real-Time Debugging
Section titled “Event Stream for Real-Time Debugging”For real-time visibility into run progress without watching server logs, use the SSE event stream:
# Watch all events for an orgcurl -N -H "Authorization: Bearer fab_xxx" \ http://localhost:3001/v1/events/stream?organization_id=<org_id>
# Watch a specific runcurl -N -H "Authorization: Bearer fab_xxx" \ http://localhost:3001/v1/events/stream?run_id=<run_id>Events include: workflow.run.started, workflow.run.completed, workflow.run.failed, workflow.node.started, workflow.node.completed.