Observability & Debugging

Fabric uses the Rust tracing ecosystem for structured logging. Every significant control-plane action — from inbound webhook triggers through subprocess lifecycle — emits structured log events with correlation fields (run_id, org_id, workflow, sink_id).

Log Levels

Level	What you see
`error`	Workflow failures, subprocess crashes, bootstrap failures
`warn`	Verification failures, missing sinks, stale recovery triggers, pool pressure
`info`	Trigger accepted, subprocess spawned/exited, workflow completed, startup banners
`debug`	Sink path lookups, heartbeat ticks, internal state transitions
`trace`	Detailed SQL queries, HTTP request/response bodies (very verbose)

Configuring Log Output

Set the RUST_LOG environment variable before starting the server:

# Standard operation (recommended default)
RUST_LOG=info fabric serve

# Debug the trigger → execution pipeline
RUST_LOG=info,fabric_api::routes::sinks=debug,fabric_events::sink=debug,fabric_engine=debug fabric serve

# Debug only the subprocess lifecycle
RUST_LOG=info,fabric_engine::client::execution=debug fabric serve

# Everything (very verbose — development only)
RUST_LOG=debug fabric serve

The filter syntax follows tracing-subscriber EnvFilter rules:

info — global info level
fabric_engine=debug — debug for a specific crate
fabric_api::routes::sinks=trace — trace for a specific module

Debugging Workflow Runs

The Trigger → Execution Pipeline

When a client (e.g., Socialsite) submits a run via webhook, the request flows through:

POST /v1/sinks/trigger/:org_id/*path
  → SinkRegistry.trigger_http()     — path lookup in memory index
    → HttpSink.handle_trigger()     — verification + payload merge
      → EngineSinkExecutor.execute() — bridge to engine
        → FabricWorkflowClient.submit() — DB insert + subprocess spawn
          → Python DurableEngine.run()  — actual workflow execution

Each stage logs at info level on success and warn/error on failure.

Common Failure Scenarios

Webhook never arrives

Symptom: No “Incoming HTTP sink trigger” log.

Check:

Is the server actually running on the expected port? Look for the startup line: Fabric HTTP on <addr>
Did the port fall back? Look for PORT ... IS IN USE — fabric fell back to port ...
Is the client hitting the correct URL (/v1/sinks/trigger/{org_id}/{path})?

Sink path not found

Symptom: No sink found — path not in index warning with registered_count=0.

Cause: Sink registrations weren’t loaded from the database on startup.

Check:

Look for: Loaded sink registrations from database at startup (with count)
If missing or n=0: no registrations exist in the event_sink_registrations table for this org
Re-register the sink via POST /v1/organizations/:org_id/sinks

Sink path mismatch

Symptom: No sink found — path not in index with registered_count > 0 and listed paths don’t match.

Cause: The path in the webhook URL doesn’t match the registered sink path exactly.

Check: The log prints all registered paths — compare against the incoming path. Common issues:

Trailing slash mismatch (/hooks/socialsite vs /hooks/socialsite/)
Case sensitivity (/hooks/Socialsite vs /hooks/socialsite)

Subprocess never exits (run stuck in “running”)

Symptom: “Subprocess spawned” log appears, heartbeat ticks continue, but no “Subprocess exited” log.

Cause: The Python process is blocked — usually on a network call (API timeout, unreachable local model).

Diagnose:

# Find the stuck Python process
ps aux | grep DurableEngine

# Check what it's doing
lsof -p <PID>   # open files/sockets

# Check the run's updated_at in the database
psql -c "SELECT id, status, updated_at, error FROM fabric_workflow_runs WHERE status = 'running' ORDER BY updated_at DESC LIMIT 5;"

Fix: Kill the stuck process — the heartbeat will stop, and after 120s the stale recovery sweep will mark it failed.

Subprocess exits immediately with error

Symptom: “Subprocess exited” with exit_ok=false shortly after spawn.

Check: The error is captured from stderr and stored in the run record:

psql -c "SELECT id, workflow_name, error FROM fabric_workflow_runs WHERE status = 'failed' ORDER BY created_at DESC LIMIT 5;"

Common causes:

Python not found (no .venv/bin/python and no python3 on PATH)
Missing Python dependency (ModuleNotFoundError)
Workflow not in registry (Workflow not found: <name>)

Subprocess budget exhausted

Symptom: Run immediately fails with “Subprocess connection budget exhausted”.

Cause: Too many workflows running concurrently. Default limit is max_concurrent_workflows (20) × subprocess_pool_per (2) = 40 active DB connections for subprocesses.

Check:

psql -c "SELECT count(*) FROM fabric_workflow_runs WHERE status = 'running';"

Fix: Wait for running workflows to complete, or increase DATABASE_MAX_CONCURRENT_WORKFLOWS.

Monitoring in Production

Key Metrics

Fabric exposes Prometheus metrics at the /metrics endpoint:

Metric	What it tells you
`fabric_workflow_runs_total`	Total runs by status
`fabric_pool_bg_acquires_total`	Background pool usage
`fabric_pool_bg_exhausted_total`	Pool budget exhaustion events
`fabric_circuit_trips_total`	Circuit breaker activations
`fabric_cfs_throttled_usec`	CPU throttling (containers)

Health Check

curl http://localhost:3001/health

Pool Health Logger

When the pool is degraded, Fabric logs a health snapshot every 60 seconds:

PoolService health (60s window) total_bg_acquires=42 failed_bg_acquires=3 bg_budget_exhausted=1 subprocess_budget_exhausted=0 circuit_trips=0 pressure=Elevated

Pressure levels: Normal → Elevated → Critical. At Critical, background tasks (maintenance, cron, asset reaping) are paused to preserve connections for HTTP handlers and active workflows.

Event Stream for Real-Time Debugging

For real-time visibility into run progress without watching server logs, use the SSE event stream:

# Watch all events for an org
curl -N -H "Authorization: Bearer fab_xxx" \
  http://localhost:3001/v1/events/stream?organization_id=<org_id>

# Watch a specific run
curl -N -H "Authorization: Bearer fab_xxx" \
  http://localhost:3001/v1/events/stream?run_id=<run_id>

Events include: workflow.run.started, workflow.run.completed, workflow.run.failed, workflow.node.started, workflow.node.completed.