Workflow Run Lifecycle

Every workflow run passes through a well-defined set of states, emitting domain events at each transition. This guide walks through the full lifecycle from submission to completion, explains when each event fires, and helps you choose the right observability mechanism.

For how to define workflows, see Workflow Authoring. For single-node jobs, see Submitting Jobs.

Overview

Client calls POST /v1/workflows/run with input and an optional state object
Fabric creates a tracking record and emits workflow.run.created
The run enters the priority queue (workflow.run.queued)
A worker claims the run and spawns N subprocesses — one per fan-out slot (workflow.run.started). Default N = 1; set input.variants (1–10) for parallel copies of the same workflow, or input.bundle for N different workflows in one submission.
Nodes execute in DAG order, each emitting workflow.node.started / workflow.node.completed events
When all subprocesses finish (or one fails under fail-fast), the run reaches a terminal status, the original state is echoed back, and per-variant outputs are aggregated into a uniform outputs: [...] array on the run row

All events are delivered through both SSE/WebSocket and webhooks in the same DomainEvent format.

Run State Machine

A workflow run progresses through these statuses:

Status	Description
`pending`	Run record created, not yet executing
`running`	Executor is actively processing nodes
`completed`	All nodes completed successfully
`failed`	A node failed (or the run was marked failed during recovery)
`cancelled`	Cancelled by user, API call, or fail-fast
`paused`	Execution paused (resumable)
`waiting`	Blocked on external input (signal/approval)

Status Transitions

From	To	Trigger
`pending`	`running`	Worker spawns the subprocess
`running`	`completed`	All DAG paths resolved successfully
`running`	`failed`	A required path failed
`running`	`cancelled`	`POST .../cancel`, or fail-fast on node failure
`running`	`paused`	`POST .../pause`
`running`	`waiting`	Node requests external input
`paused`	`running`	`POST .../resume`
`waiting`	`running`	Signal received via `POST .../signal` or `POST .../approve`

Node State Machine

Each node (task) within a run has its own lifecycle:

Status	Event	Description
pending	—	Node created, waiting for dependencies
active	`workflow.node.started`	Node is executing
—	`workflow.node.progress`	Intra-node progress (0+ times, via SDK `log.emit()`)
completed	`workflow.node.completed`	Node finished successfully
failed	`workflow.node.failed`	Node failed
skipped	`workflow.node.skipped`	Upstream dependency failed; node never ran

Nodes with retry configuration will cycle back from failed to active until retries are exhausted. If a node’s error column is set, the event bridge always emits workflow.node.failed regardless of the textual status.

Event Timeline

Happy Path

For a linear workflow A → B → C, the event sequence is:

workflow.run.created — on POST /v1/workflows/run
workflow.run.queued — enters priority queue
workflow.run.promoted — worker picks it up
workflow.run.started — Python subprocess spawns
workflow.node.started (A)
workflow.node.completed (A)
workflow.node.started (B)
workflow.node.progress (B) — optional, 0+ times
workflow.node.completed (B)
workflow.node.started (C)
workflow.node.completed (C)
workflow.run.completed — includes payload.state

Failure (Fail-Fast)

By default, any node failure cancels the entire run. For a branching DAG where A → B and A → C, if B fails:

workflow.node.started (B)
workflow.node.failed (B)
workflow.run.cancelled — fail-fast triggers, remaining nodes are cancelled
workflow.node.skipped (any downstream of B)

The run reaches the cancelled status with a reason indicating which node failed.

To allow independent branches to continue after a failure, submit the run with fail_fast: false:

const run = await fabric.workflows.runs.submit("research/deep-research", {
  input: { topic: "AI trends" },
  fail_fast: false, // allow partial completion
});

run = fabric.submit_workflow(
    "research/deep-research",
    input={"topic": "AI trends"},
    fail_fast=False,  # allow partial completion
)

curl -X POST 'https://gofabric.dev/v1/workflows/run?name=research/deep-research' \
  -H "Authorization: Bearer fab_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "input": { "topic": "AI trends" },
    "fail_fast": false
  }'

With fail_fast: false, node C continues executing even if B fails. The run reaches failed only after all reachable paths resolve.

The `state` Parameter

The state field lets you attach opaque context to a run that’s echoed back on completion — no follow-up API call needed.

Lifecycle:

Submit — include state in the request body (any JSON)
Stored — persisted on the run record in the database
Readable — returned in GET /v1/workflow-runs/{id} responses
Echoed — included in payload.state on terminal events (completed, failed, cancelled)
Streamed — visible in SSE and webhook deliveries

TypeScript
curl

// Submit with state
const run = await fabric.workflows.runs.submit("content/generate", {
  input: { topic: "quarterly trends" },
  state: { workspace_id: "ws-123", triggered_by: "user-456" },
});

// Receive state in webhook
app.post("/webhook", async (req, res) => {
  const event = await constructWebhookEvent(req.body, req.headers["x-fabric-signature"]!, secret);
  if (event.kind === "workflow.run.completed") {
    const { workspace_id } = event.payload.state;
    // Route output to the correct workspace
  }
  res.sendStatus(200);
});

curl -X POST 'https://gofabric.dev/v1/workflows/run?name=content/generate' \
  -H "Authorization: Bearer fab_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "input": { "topic": "quarterly trends" },
    "state": { "workspace_id": "ws-123", "triggered_by": "user-456" }
  }'

Variants and Bundles — Parallel Fan-out

A run is N parallel subprocesses where N defaults to 1. Two ways to fan out:

Knob	Shape	Use when
`input.variants: 3`	N copies of the same workflow with the same input	”Give me 3 different videos for the same brief”
`input.bundle: [...]`	N different workflows, each with its own input	”Give me 1 video + 1 carousel + 1 thread for the same brief”

The two are mutually exclusive — set one or the other, not both.

The response shape is uniform regardless of N:

{
  "run_id": "...",
  "status": "completed",
  "variants": 3,
  "outputs": [
    { "variant_index": 0, "workflow_name": "video/ai_shorts", "kind": "video",    "output": {...}, "artifacts": [...] },
    { "variant_index": 1, "workflow_name": "video/ai_shorts", "kind": "video",    "output": {...}, "artifacts": [...] },
    { "variant_index": 2, "workflow_name": "video/ai_shorts", "kind": "video",    "output": {...}, "artifacts": [...] }
  ]
}

outputs.length === variants. Single-workflow runs have one entry; consumers iterate the array uniformly with no shape branching.

fail_fast semantics (see Failure and Recovery) apply across the fan-out — any one variant failing fails the run by default; set fail_fast: false to let successful variants ship even when others fail.

See the TypeScript SDK reference and Python SDK reference for submitAndGetOutput / submit_and_get_output helpers.

Regenerate — Variations of an Existing Run

Pass input.regenerate on submit to mark a run as a regeneration of an earlier run/variant. The server persists parent_run_id and parent_variant_index as run lineage columns. Workflows that care read direction (punchier / deeper / contrarian / visual / data-first / surprise), keep flags, and extra_instructions off input.regenerate and modulate their prompts; workflows that don’t care ignore it.

{
  "input": {
    "topic": "AI trends",
    "regenerate": {
      "direction": "punchier",
      "keep": ["platform", "tone_of_voice"],
      "parent_run_id": "...",
      "parent_variant_index": 0
    }
  }
}

This works orthogonally with variants and bundle — a regeneration can itself fan out into N variants.

Observing the Lifecycle

Fabric offers three mechanisms for tracking run progress. Choose based on your use case:

Need	Mechanism	Why
Real-time UI progress bar	SSE via `useWorkflowRun`	Low latency, auto-closes on terminal event, replays on reconnect
Backend pipeline trigger	Webhook	Push-based, no open connection, retries built in
Poll from a cron job	`GET /v1/workflow-runs/{id}`	Simplest — no subscription setup
UI updates and backend processing	SSE + Webhook	SSE for the browser, webhook for your server
Audit log of all events	Webhook (all events)	Persistent delivery with retry guarantee

Key behavioral differences:

SSE auto-closes after terminal run events (completed, failed, cancelled). Webhooks fire independently for all matching events.
SSE replays all past events on connect (catch-up). Webhooks are fire-and-forget with retries.
SSE is per-connection. Webhooks are per-subscription (org-scoped, persist across connections).
Both deliver the same DomainEvent structure.

Failure and Recovery

Fail-Fast (Default)

When fail_fast is true (the default), any workflow.node.failed event triggers automatic cancellation of the run. This is the right behavior for most sequential pipelines — if script generation fails, there’s no point running voiceover or rendering.

Partial Completion

Set fail_fast: false when you have independent branches that should continue even if one fails. Use this for fan-out patterns, best-effort processing, or workflows where partial output is useful.

Stale Recovery

The event bridge periodically scans for stalled workflows:

Stale threshold: 120 seconds of inactivity (no task transitions or heartbeats)
Recovery interval: Every 60 seconds
Max attempts: 3 recovery attempts before marking the run as failed

If a workflow is legitimately long-running (e.g., video rendering), the subprocess sends heartbeats every 30 seconds to prevent false recovery.

Subprocess Budget

Each workflow run requires a subprocess connection slot. If all slots are occupied, the run fails immediately with "Subprocess connection budget exhausted". Monitor this error to scale capacity.

Cancellation and Pause/Resume

// Cancel a running workflow
await fabric.workflows.runs.cancel(runId, { reason: "No longer needed" });

// Pause execution
await fabric.workflows.runs.pause(runId, { reason: "Waiting for approval" });

// Resume from paused state
await fabric.workflows.runs.resume(runId);

// Send a signal (resumes from waiting state)
await fabric.workflows.runs.signal(runId, {
  signal_name: "approval",
  payload: { approved: true },
});

# Cancel
fabric.cancel_workflow_run(run_id, reason="No longer needed")

# Pause
fabric.pause_workflow_run(run_id, reason="Waiting for approval")

# Resume
fabric.resume_workflow_run(run_id)

# Signal
fabric.signal_workflow_run(run_id, signal_name="approval", payload={"approved": True})

# Cancel
curl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/cancel" \
  -H "Authorization: Bearer fab_xxx" \
  -H "Content-Type: application/json" \
  -d '{"reason": "No longer needed"}'

# Pause
curl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/pause" \
  -H "Authorization: Bearer fab_xxx" \
  -d '{"reason": "Waiting for approval"}'

# Resume
curl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/resume" \
  -H "Authorization: Bearer fab_xxx"

# Signal
curl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/signal" \
  -H "Authorization: Bearer fab_xxx" \
  -H "Content-Type: application/json" \
  -d '{"signal_name": "approval", "payload": {"approved": true}}'

Each operation emits a corresponding event (workflow.run.cancelled, workflow.run.paused, etc.) delivered via both SSE and webhooks. Cancellation echoes payload.state just like completion.

Internal Shim Nodes

The SDK inserts two internal nodes on every run:

_fabric_capture_input — captures raw input at the start
_fabric_finalize_output — collects artifacts and final output at the end

These are hidden from SSE by default to keep event streams clean. To include them (useful for debugging), add ?include_internal=true to the SSE endpoint:

curl -N "https://gofabric.dev/v1/workflow-runs/$RUN_ID/events?include_internal=true"

Quick Reference

Phase	Run Status	Events	SSE	Webhook
Submit	`pending`	`run.created`	Yes	Yes
Queue	`pending`	`run.queued`, `run.promoted`	Yes	Yes
Execute	`running`	`run.started`, `node.*`	Yes	Yes
Complete	`completed`	`run.completed` (+ state)	Yes, then closes	Yes
Fail	`failed`	`run.failed` (+ state)	Yes, then closes	Yes
Cancel	`cancelled`	`run.cancelled` (+ state)	Yes, then closes	Yes
Pause	`paused`	`run.paused`	Yes	Yes
Wait	`waiting`	`run.waiting`	Yes	Yes