Skip to content

Observability & Debugging

Fabric uses the Rust tracing ecosystem for structured logging. Every significant control-plane action — from inbound webhook triggers through subprocess lifecycle — emits structured log events with correlation fields (run_id, org_id, workflow, sink_id).

LevelWhat you see
errorWorkflow failures, subprocess crashes, bootstrap failures
warnVerification failures, missing sinks, stale recovery triggers, pool pressure
infoTrigger accepted, subprocess spawned/exited, workflow completed, startup banners
debugSink path lookups, heartbeat ticks, internal state transitions
traceDetailed SQL queries, HTTP request/response bodies (very verbose)

Set the RUST_LOG environment variable before starting the server:

Terminal window
# Standard operation (recommended default)
RUST_LOG=info fabric serve
# Debug the trigger → execution pipeline
RUST_LOG=info,fabric_api::routes::sinks=debug,fabric_events::sink=debug,fabric_engine=debug fabric serve
# Debug only the subprocess lifecycle
RUST_LOG=info,fabric_engine::client::execution=debug fabric serve
# Everything (very verbose — development only)
RUST_LOG=debug fabric serve

The filter syntax follows tracing-subscriber EnvFilter rules:

  • info — global info level
  • fabric_engine=debug — debug for a specific crate
  • fabric_api::routes::sinks=trace — trace for a specific module

When a client (e.g., Socialsite) submits a run via webhook, the request flows through:

POST /v1/sinks/trigger/:org_id/*path
→ SinkRegistry.trigger_http() — path lookup in memory index
→ HttpSink.handle_trigger() — verification + payload merge
→ EngineSinkExecutor.execute() — bridge to engine
→ FabricWorkflowClient.submit() — DB insert + subprocess spawn
→ Python DurableEngine.run() — actual workflow execution

Each stage logs at info level on success and warn/error on failure.

Symptom: No “Incoming HTTP sink trigger” log.

Check:

  • Is the server actually running on the expected port? Look for the startup line: Fabric HTTP on <addr>
  • Did the port fall back? Look for PORT ... IS IN USE — fabric fell back to port ...
  • Is the client hitting the correct URL (/v1/sinks/trigger/{org_id}/{path})?

Symptom: No sink found — path not in index warning with registered_count=0.

Cause: Sink registrations weren’t loaded from the database on startup.

Check:

  • Look for: Loaded sink registrations from database at startup (with count)
  • If missing or n=0: no registrations exist in the event_sink_registrations table for this org
  • Re-register the sink via POST /v1/organizations/:org_id/sinks

Symptom: No sink found — path not in index with registered_count > 0 and listed paths don’t match.

Cause: The path in the webhook URL doesn’t match the registered sink path exactly.

Check: The log prints all registered paths — compare against the incoming path. Common issues:

  • Trailing slash mismatch (/hooks/socialsite vs /hooks/socialsite/)
  • Case sensitivity (/hooks/Socialsite vs /hooks/socialsite)

Subprocess never exits (run stuck in “running”)

Section titled “Subprocess never exits (run stuck in “running”)”

Symptom: “Subprocess spawned” log appears, heartbeat ticks continue, but no “Subprocess exited” log.

Cause: The Python process is blocked — usually on a network call (API timeout, unreachable local model).

Diagnose:

Terminal window
# Find the stuck Python process
ps aux | grep DurableEngine
# Check what it's doing
lsof -p <PID> # open files/sockets
# Check the run's updated_at in the database
psql -c "SELECT id, status, updated_at, error FROM fabric_workflow_runs WHERE status = 'running' ORDER BY updated_at DESC LIMIT 5;"

Fix: Kill the stuck process — the heartbeat will stop, and after 120s the stale recovery sweep will mark it failed.

Symptom: “Subprocess exited” with exit_ok=false shortly after spawn.

Check: The error is captured from stderr and stored in the run record:

Terminal window
psql -c "SELECT id, workflow_name, error FROM fabric_workflow_runs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 5;"

Common causes:

  • Python not found (no .venv/bin/python and no python3 on PATH)
  • Missing Python dependency (ModuleNotFoundError)
  • Workflow not in registry (Workflow not found: <name>)

Symptom: Run immediately fails with “Subprocess connection budget exhausted”.

Cause: Too many workflows running concurrently. Default limit is max_concurrent_workflows (20) × subprocess_pool_per (2) = 40 active DB connections for subprocesses.

Check:

Terminal window
psql -c "SELECT count(*) FROM fabric_workflow_runs WHERE status = 'running';"

Fix: Wait for running workflows to complete, or increase DATABASE_MAX_CONCURRENT_WORKFLOWS.

Fabric exposes Prometheus metrics at the /metrics endpoint:

MetricWhat it tells you
fabric_workflow_runs_totalTotal runs by status
fabric_pool_bg_acquires_totalBackground pool usage
fabric_pool_bg_exhausted_totalPool budget exhaustion events
fabric_circuit_trips_totalCircuit breaker activations
fabric_cfs_throttled_usecCPU throttling (containers)
Terminal window
curl http://localhost:3001/health

When the pool is degraded, Fabric logs a health snapshot every 60 seconds:

PoolService health (60s window) total_bg_acquires=42 failed_bg_acquires=3 bg_budget_exhausted=1 subprocess_budget_exhausted=0 circuit_trips=0 pressure=Elevated

Pressure levels: NormalElevatedCritical. At Critical, background tasks (maintenance, cron, asset reaping) are paused to preserve connections for HTTP handlers and active workflows.

For real-time visibility into run progress without watching server logs, use the SSE event stream:

Terminal window
# Watch all events for an org
curl -N -H "Authorization: Bearer fab_xxx" \
http://localhost:3001/v1/events/stream?organization_id=<org_id>
# Watch a specific run
curl -N -H "Authorization: Bearer fab_xxx" \
http://localhost:3001/v1/events/stream?run_id=<run_id>

Events include: workflow.run.started, workflow.run.completed, workflow.run.failed, workflow.node.started, workflow.node.completed.