Workflow Run Lifecycle
Every workflow run passes through a well-defined set of states, emitting domain events at each transition. This guide walks through the full lifecycle from submission to completion, explains when each event fires, and helps you choose the right observability mechanism.
For how to define workflows, see Workflow Authoring. For single-node jobs, see Submitting Jobs.
Overview
Section titled “Overview”- Client calls
POST /v1/workflows/runwith input and an optionalstateobject - Fabric creates a tracking record and emits
workflow.run.created - The run enters the priority queue (
workflow.run.queued) - A worker claims the run and spawns N subprocesses — one per fan-out slot (
workflow.run.started). Default N = 1; setinput.variants(1–10) for parallel copies of the same workflow, orinput.bundlefor N different workflows in one submission. - Nodes execute in DAG order, each emitting
workflow.node.started/workflow.node.completedevents - When all subprocesses finish (or one fails under fail-fast), the run reaches a terminal status, the original
stateis echoed back, and per-variant outputs are aggregated into a uniformoutputs: [...]array on the run row
All events are delivered through both SSE/WebSocket and webhooks in the same DomainEvent format.
Run State Machine
Section titled “Run State Machine”A workflow run progresses through these statuses:
| Status | Description |
|---|---|
pending | Run record created, not yet executing |
running | Executor is actively processing nodes |
completed | All nodes completed successfully |
failed | A node failed (or the run was marked failed during recovery) |
cancelled | Cancelled by user, API call, or fail-fast |
paused | Execution paused (resumable) |
waiting | Blocked on external input (signal/approval) |
Status Transitions
Section titled “Status Transitions”| From | To | Trigger |
|---|---|---|
pending | running | Worker spawns the subprocess |
running | completed | All DAG paths resolved successfully |
running | failed | A required path failed |
running | cancelled | POST .../cancel, or fail-fast on node failure |
running | paused | POST .../pause |
running | waiting | Node requests external input |
paused | running | POST .../resume |
waiting | running | Signal received via POST .../signal or POST .../approve |
Node State Machine
Section titled “Node State Machine”Each node (task) within a run has its own lifecycle:
| Status | Event | Description |
|---|---|---|
| pending | — | Node created, waiting for dependencies |
| active | workflow.node.started | Node is executing |
| — | workflow.node.progress | Intra-node progress (0+ times, via SDK log.emit()) |
| completed | workflow.node.completed | Node finished successfully |
| failed | workflow.node.failed | Node failed |
| skipped | workflow.node.skipped | Upstream dependency failed; node never ran |
Nodes with retry configuration will cycle back from failed to active until retries are exhausted. If a node’s error column is set, the event bridge always emits workflow.node.failed regardless of the textual status.
Event Timeline
Section titled “Event Timeline”Happy Path
Section titled “Happy Path”For a linear workflow A → B → C, the event sequence is:
workflow.run.created— onPOST /v1/workflows/runworkflow.run.queued— enters priority queueworkflow.run.promoted— worker picks it upworkflow.run.started— Python subprocess spawnsworkflow.node.started(A)workflow.node.completed(A)workflow.node.started(B)workflow.node.progress(B) — optional, 0+ timesworkflow.node.completed(B)workflow.node.started(C)workflow.node.completed(C)workflow.run.completed— includespayload.state
Failure (Fail-Fast)
Section titled “Failure (Fail-Fast)”By default, any node failure cancels the entire run. For a branching DAG where A → B and A → C, if B fails:
workflow.node.started(B)workflow.node.failed(B)workflow.run.cancelled— fail-fast triggers, remaining nodes are cancelledworkflow.node.skipped(any downstream of B)
The run reaches the cancelled status with a reason indicating which node failed.
To allow independent branches to continue after a failure, submit the run with fail_fast: false:
const run = await fabric.workflows.runs.submit("research/deep-research", { input: { topic: "AI trends" }, fail_fast: false, // allow partial completion});run = fabric.submit_workflow( "research/deep-research", input={"topic": "AI trends"}, fail_fast=False, # allow partial completion)curl -X POST 'https://gofabric.dev/v1/workflows/run?name=research/deep-research' \ -H "Authorization: Bearer fab_xxx" \ -H "Content-Type: application/json" \ -d '{ "input": { "topic": "AI trends" }, "fail_fast": false }'With fail_fast: false, node C continues executing even if B fails. The run reaches failed only after all reachable paths resolve.
The state Parameter
Section titled “The state Parameter”The state field lets you attach opaque context to a run that’s echoed back on completion — no follow-up API call needed.
Lifecycle:
- Submit — include
statein the request body (any JSON) - Stored — persisted on the run record in the database
- Readable — returned in
GET /v1/workflow-runs/{id}responses - Echoed — included in
payload.stateon terminal events (completed,failed,cancelled) - Streamed — visible in SSE and webhook deliveries
// Submit with stateconst run = await fabric.workflows.runs.submit("content/generate", { input: { topic: "quarterly trends" }, state: { workspace_id: "ws-123", triggered_by: "user-456" },});
// Receive state in webhookapp.post("/webhook", async (req, res) => { const event = await constructWebhookEvent(req.body, req.headers["x-fabric-signature"]!, secret); if (event.kind === "workflow.run.completed") { const { workspace_id } = event.payload.state; // Route output to the correct workspace } res.sendStatus(200);});curl -X POST 'https://gofabric.dev/v1/workflows/run?name=content/generate' \ -H "Authorization: Bearer fab_xxx" \ -H "Content-Type: application/json" \ -d '{ "input": { "topic": "quarterly trends" }, "state": { "workspace_id": "ws-123", "triggered_by": "user-456" } }'Variants and Bundles — Parallel Fan-out
Section titled “Variants and Bundles — Parallel Fan-out”A run is N parallel subprocesses where N defaults to 1. Two ways to fan out:
| Knob | Shape | Use when |
|---|---|---|
input.variants: 3 | N copies of the same workflow with the same input | ”Give me 3 different videos for the same brief” |
input.bundle: [...] | N different workflows, each with its own input | ”Give me 1 video + 1 carousel + 1 thread for the same brief” |
The two are mutually exclusive — set one or the other, not both.
The response shape is uniform regardless of N:
{ "run_id": "...", "status": "completed", "variants": 3, "outputs": [ { "variant_index": 0, "workflow_name": "video/ai_shorts", "kind": "video", "output": {...}, "artifacts": [...] }, { "variant_index": 1, "workflow_name": "video/ai_shorts", "kind": "video", "output": {...}, "artifacts": [...] }, { "variant_index": 2, "workflow_name": "video/ai_shorts", "kind": "video", "output": {...}, "artifacts": [...] } ]}outputs.length === variants. Single-workflow runs have one entry; consumers iterate the array uniformly with no shape branching.
fail_fast semantics (see Failure and Recovery) apply across the fan-out — any one variant failing fails the run by default; set fail_fast: false to let successful variants ship even when others fail.
See the TypeScript SDK reference and Python SDK reference for submitAndGetOutput / submit_and_get_output helpers.
Regenerate — Variations of an Existing Run
Section titled “Regenerate — Variations of an Existing Run”Pass input.regenerate on submit to mark a run as a regeneration of an earlier run/variant. The server persists parent_run_id and parent_variant_index as run lineage columns. Workflows that care read direction (punchier / deeper / contrarian / visual / data-first / surprise), keep flags, and extra_instructions off input.regenerate and modulate their prompts; workflows that don’t care ignore it.
{ "input": { "topic": "AI trends", "regenerate": { "direction": "punchier", "keep": ["platform", "tone_of_voice"], "parent_run_id": "...", "parent_variant_index": 0 } }}This works orthogonally with variants and bundle — a regeneration can itself fan out into N variants.
Observing the Lifecycle
Section titled “Observing the Lifecycle”Fabric offers three mechanisms for tracking run progress. Choose based on your use case:
| Need | Mechanism | Why |
|---|---|---|
| Real-time UI progress bar | SSE via useWorkflowRun | Low latency, auto-closes on terminal event, replays on reconnect |
| Backend pipeline trigger | Webhook | Push-based, no open connection, retries built in |
| Poll from a cron job | GET /v1/workflow-runs/{id} | Simplest — no subscription setup |
| UI updates and backend processing | SSE + Webhook | SSE for the browser, webhook for your server |
| Audit log of all events | Webhook (all events) | Persistent delivery with retry guarantee |
Key behavioral differences:
- SSE auto-closes after terminal run events (
completed,failed,cancelled). Webhooks fire independently for all matching events. - SSE replays all past events on connect (catch-up). Webhooks are fire-and-forget with retries.
- SSE is per-connection. Webhooks are per-subscription (org-scoped, persist across connections).
- Both deliver the same
DomainEventstructure.
Failure and Recovery
Section titled “Failure and Recovery”Fail-Fast (Default)
Section titled “Fail-Fast (Default)”When fail_fast is true (the default), any workflow.node.failed event triggers automatic cancellation of the run. This is the right behavior for most sequential pipelines — if script generation fails, there’s no point running voiceover or rendering.
Partial Completion
Section titled “Partial Completion”Set fail_fast: false when you have independent branches that should continue even if one fails. Use this for fan-out patterns, best-effort processing, or workflows where partial output is useful.
Stale Recovery
Section titled “Stale Recovery”The event bridge periodically scans for stalled workflows:
- Stale threshold: 120 seconds of inactivity (no task transitions or heartbeats)
- Recovery interval: Every 60 seconds
- Max attempts: 3 recovery attempts before marking the run as failed
If a workflow is legitimately long-running (e.g., video rendering), the subprocess sends heartbeats every 30 seconds to prevent false recovery.
Subprocess Budget
Section titled “Subprocess Budget”Each workflow run requires a subprocess connection slot. If all slots are occupied, the run fails immediately with "Subprocess connection budget exhausted". Monitor this error to scale capacity.
Cancellation and Pause/Resume
Section titled “Cancellation and Pause/Resume”// Cancel a running workflowawait fabric.workflows.runs.cancel(runId, { reason: "No longer needed" });
// Pause executionawait fabric.workflows.runs.pause(runId, { reason: "Waiting for approval" });
// Resume from paused stateawait fabric.workflows.runs.resume(runId);
// Send a signal (resumes from waiting state)await fabric.workflows.runs.signal(runId, { signal_name: "approval", payload: { approved: true },});# Cancelfabric.cancel_workflow_run(run_id, reason="No longer needed")
# Pausefabric.pause_workflow_run(run_id, reason="Waiting for approval")
# Resumefabric.resume_workflow_run(run_id)
# Signalfabric.signal_workflow_run(run_id, signal_name="approval", payload={"approved": True})# Cancelcurl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/cancel" \ -H "Authorization: Bearer fab_xxx" \ -H "Content-Type: application/json" \ -d '{"reason": "No longer needed"}'
# Pausecurl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/pause" \ -H "Authorization: Bearer fab_xxx" \ -d '{"reason": "Waiting for approval"}'
# Resumecurl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/resume" \ -H "Authorization: Bearer fab_xxx"
# Signalcurl -X POST "https://gofabric.dev/v1/workflow-runs/$RUN_ID/signal" \ -H "Authorization: Bearer fab_xxx" \ -H "Content-Type: application/json" \ -d '{"signal_name": "approval", "payload": {"approved": true}}'Each operation emits a corresponding event (workflow.run.cancelled, workflow.run.paused, etc.) delivered via both SSE and webhooks. Cancellation echoes payload.state just like completion.
Internal Shim Nodes
Section titled “Internal Shim Nodes”The SDK inserts two internal nodes on every run:
_fabric_capture_input— captures raw input at the start_fabric_finalize_output— collects artifacts and final output at the end
These are hidden from SSE by default to keep event streams clean. To include them (useful for debugging), add ?include_internal=true to the SSE endpoint:
curl -N "https://gofabric.dev/v1/workflow-runs/$RUN_ID/events?include_internal=true"Quick Reference
Section titled “Quick Reference”| Phase | Run Status | Events | SSE | Webhook |
|---|---|---|---|---|
| Submit | pending | run.created | Yes | Yes |
| Queue | pending | run.queued, run.promoted | Yes | Yes |
| Execute | running | run.started, node.* | Yes | Yes |
| Complete | completed | run.completed (+ state) | Yes, then closes | Yes |
| Fail | failed | run.failed (+ state) | Yes, then closes | Yes |
| Cancel | cancelled | run.cancelled (+ state) | Yes, then closes | Yes |
| Pause | paused | run.paused | Yes | Yes |
| Wait | waiting | run.waiting | Yes | Yes |