Skip to content

Workflow Run Lifecycle

Every workflow run passes through a well-defined set of states, emitting domain events at each transition. This guide walks through the full lifecycle from submission to completion, explains when each event fires, and helps you choose the right observability mechanism.

For how to define workflows, see Workflow Authoring. For single-node jobs, see Submitting Jobs.

  1. Client calls POST /v1/workflows/run with input and an optional state object
  2. Fabric creates a tracking record and emits workflow.run.created
  3. The run enters the priority queue (workflow.run.queued)
  4. A worker claims the run and spawns N subprocesses — one per fan-out slot (workflow.run.started). Default N = 1; set input.variants (1–10) for parallel copies of the same workflow, or input.bundle for N different workflows in one submission.
  5. Nodes execute in DAG order, each emitting workflow.node.started / workflow.node.completed events
  6. When all subprocesses finish (or one fails under fail-fast), the run reaches a terminal status, the original state is echoed back, and per-variant outputs are aggregated into a uniform outputs: [...] array on the run row

All events are delivered through both SSE/WebSocket and webhooks in the same DomainEvent format.

A workflow run progresses through these statuses:

StatusDescription
pendingRun record created, not yet executing
runningExecutor is actively processing nodes
completedAll nodes completed successfully
failedA node failed (or the run was marked failed during recovery)
cancelledCancelled by user, API call, or fail-fast
pausedExecution paused (resumable)
waitingBlocked on external input (signal/approval)
FromToTrigger
pendingrunningWorker spawns the subprocess
runningcompletedAll DAG paths resolved successfully
runningfailedA required path failed
runningcancelledPOST .../cancel, or fail-fast on node failure
runningpausedPOST .../pause
runningwaitingNode requests external input
pausedrunningPOST .../resume
waitingrunningSignal received via POST .../signal or POST .../approve

Each node (task) within a run has its own lifecycle:

StatusEventDescription
pendingNode created, waiting for dependencies
activeworkflow.node.startedNode is executing
workflow.node.progressIntra-node progress (0+ times, via SDK log.emit())
completedworkflow.node.completedNode finished successfully
failedworkflow.node.failedNode failed
skippedworkflow.node.skippedUpstream dependency failed; node never ran

Nodes with retry configuration will cycle back from failed to active until retries are exhausted. If a node’s error column is set, the event bridge always emits workflow.node.failed regardless of the textual status.

For a linear workflow A → B → C, the event sequence is:

  1. workflow.run.created — on POST /v1/workflows/run
  2. workflow.run.queued — enters priority queue
  3. workflow.run.promoted — worker picks it up
  4. workflow.run.started — Python subprocess spawns
  5. workflow.node.started (A)
  6. workflow.node.completed (A)
  7. workflow.node.started (B)
  8. workflow.node.progress (B) — optional, 0+ times
  9. workflow.node.completed (B)
  10. workflow.node.started (C)
  11. workflow.node.completed (C)
  12. workflow.run.completed — includes payload.state

By default, any node failure cancels the entire run. For a branching DAG where A → B and A → C, if B fails:

  1. workflow.node.started (B)
  2. workflow.node.failed (B)
  3. workflow.run.cancelled — fail-fast triggers, remaining nodes are cancelled
  4. workflow.node.skipped (any downstream of B)

The run reaches the cancelled status with a reason indicating which node failed.

To allow independent branches to continue after a failure, submit the run with fail_fast: false:

const run = await fabric.workflows.runs.submit("research/deep-research", {
input: { topic: "AI trends" },
fail_fast: false, // allow partial completion
});

With fail_fast: false, node C continues executing even if B fails. The run reaches failed only after all reachable paths resolve.

The state field lets you attach opaque context to a run that’s echoed back on completion — no follow-up API call needed.

Lifecycle:

  1. Submit — include state in the request body (any JSON)
  2. Stored — persisted on the run record in the database
  3. Readable — returned in GET /v1/workflow-runs/{id} responses
  4. Echoed — included in payload.state on terminal events (completed, failed, cancelled)
  5. Streamed — visible in SSE and webhook deliveries
// Submit with state
const run = await fabric.workflows.runs.submit("content/generate", {
input: { topic: "quarterly trends" },
state: { workspace_id: "ws-123", triggered_by: "user-456" },
});
// Receive state in webhook
app.post("/webhook", async (req, res) => {
const event = await constructWebhookEvent(req.body, req.headers["x-fabric-signature"]!, secret);
if (event.kind === "workflow.run.completed") {
const { workspace_id } = event.payload.state;
// Route output to the correct workspace
}
res.sendStatus(200);
});

A run is N parallel subprocesses where N defaults to 1. Two ways to fan out:

KnobShapeUse when
input.variants: 3N copies of the same workflow with the same input”Give me 3 different videos for the same brief”
input.bundle: [...]N different workflows, each with its own input”Give me 1 video + 1 carousel + 1 thread for the same brief”

The two are mutually exclusive — set one or the other, not both.

The response shape is uniform regardless of N:

{
"run_id": "...",
"status": "completed",
"variants": 3,
"outputs": [
{ "variant_index": 0, "workflow_name": "video/ai_shorts", "kind": "video", "output": {...}, "artifacts": [...] },
{ "variant_index": 1, "workflow_name": "video/ai_shorts", "kind": "video", "output": {...}, "artifacts": [...] },
{ "variant_index": 2, "workflow_name": "video/ai_shorts", "kind": "video", "output": {...}, "artifacts": [...] }
]
}

outputs.length === variants. Single-workflow runs have one entry; consumers iterate the array uniformly with no shape branching.

fail_fast semantics (see Failure and Recovery) apply across the fan-out — any one variant failing fails the run by default; set fail_fast: false to let successful variants ship even when others fail.

See the TypeScript SDK reference and Python SDK reference for submitAndGetOutput / submit_and_get_output helpers.

Regenerate — Variations of an Existing Run

Section titled “Regenerate — Variations of an Existing Run”

Pass input.regenerate on submit to mark a run as a regeneration of an earlier run/variant. The server persists parent_run_id and parent_variant_index as run lineage columns. Workflows that care read direction (punchier / deeper / contrarian / visual / data-first / surprise), keep flags, and extra_instructions off input.regenerate and modulate their prompts; workflows that don’t care ignore it.

{
"input": {
"topic": "AI trends",
"regenerate": {
"direction": "punchier",
"keep": ["platform", "tone_of_voice"],
"parent_run_id": "...",
"parent_variant_index": 0
}
}
}

This works orthogonally with variants and bundle — a regeneration can itself fan out into N variants.

Fabric offers three mechanisms for tracking run progress. Choose based on your use case:

NeedMechanismWhy
Real-time UI progress barSSE via useWorkflowRunLow latency, auto-closes on terminal event, replays on reconnect
Backend pipeline triggerWebhookPush-based, no open connection, retries built in
Poll from a cron jobGET /v1/workflow-runs/{id}Simplest — no subscription setup
UI updates and backend processingSSE + WebhookSSE for the browser, webhook for your server
Audit log of all eventsWebhook (all events)Persistent delivery with retry guarantee

Key behavioral differences:

  • SSE auto-closes after terminal run events (completed, failed, cancelled). Webhooks fire independently for all matching events.
  • SSE replays all past events on connect (catch-up). Webhooks are fire-and-forget with retries.
  • SSE is per-connection. Webhooks are per-subscription (org-scoped, persist across connections).
  • Both deliver the same DomainEvent structure.

When fail_fast is true (the default), any workflow.node.failed event triggers automatic cancellation of the run. This is the right behavior for most sequential pipelines — if script generation fails, there’s no point running voiceover or rendering.

Set fail_fast: false when you have independent branches that should continue even if one fails. Use this for fan-out patterns, best-effort processing, or workflows where partial output is useful.

The event bridge periodically scans for stalled workflows:

  • Stale threshold: 120 seconds of inactivity (no task transitions or heartbeats)
  • Recovery interval: Every 60 seconds
  • Max attempts: 3 recovery attempts before marking the run as failed

If a workflow is legitimately long-running (e.g., video rendering), the subprocess sends heartbeats every 30 seconds to prevent false recovery.

Each workflow run requires a subprocess connection slot. If all slots are occupied, the run fails immediately with "Subprocess connection budget exhausted". Monitor this error to scale capacity.

// Cancel a running workflow
await fabric.workflows.runs.cancel(runId, { reason: "No longer needed" });
// Pause execution
await fabric.workflows.runs.pause(runId, { reason: "Waiting for approval" });
// Resume from paused state
await fabric.workflows.runs.resume(runId);
// Send a signal (resumes from waiting state)
await fabric.workflows.runs.signal(runId, {
signal_name: "approval",
payload: { approved: true },
});

Each operation emits a corresponding event (workflow.run.cancelled, workflow.run.paused, etc.) delivered via both SSE and webhooks. Cancellation echoes payload.state just like completion.

The SDK inserts two internal nodes on every run:

  • _fabric_capture_input — captures raw input at the start
  • _fabric_finalize_output — collects artifacts and final output at the end

These are hidden from SSE by default to keep event streams clean. To include them (useful for debugging), add ?include_internal=true to the SSE endpoint:

Terminal window
curl -N "https://gofabric.dev/v1/workflow-runs/$RUN_ID/events?include_internal=true"
PhaseRun StatusEventsSSEWebhook
Submitpendingrun.createdYesYes
Queuependingrun.queued, run.promotedYesYes
Executerunningrun.started, node.*YesYes
Completecompletedrun.completed (+ state)Yes, then closesYes
Failfailedrun.failed (+ state)Yes, then closesYes
Cancelcancelledrun.cancelled (+ state)Yes, then closesYes
Pausepausedrun.pausedYesYes
Waitwaitingrun.waitingYesYes