audio.transcribe

Operation: audio.transcribe Category: media Tags: audio, whisper, transcription, speech-to-text, candle

Transcribe audio/video to text with word-level timestamps via Whisper. Supports two backends: faster-whisper (Python, default) and Candle (Rust, built-in).

Runtime

Type: Native (built-in) Timeout: 600s Retries: 2 (ExponentialWithJitter) Memory: 4096 MB

Inputs

Name	Type	Required	Default	Description
`audio_url`	Asset	No	—	URL or file:// path to an audio file
`audio_path`	String	No	—	Local filesystem path to an audio file
`video_path`	String	No	—	Local path to a video file (audio will be extracted)

Outputs

Name	Type	Description
`text`	String	Full transcript text
`segments`	Array<JSON>	Array of {start, end, text} segment objects
`words`	Array<JSON>	Array of {start, end, word} word-level timestamp objects
`language`	String	Detected language code
`duration_secs`	Number	Audio duration in seconds

Default Configuration

{
  "model": "large-v3",
  "prefer_local": true,
  "word_timestamps": true
}

Transcription Backends

Two backends are available for transcription. Both produce the same output format (segments + word-level timestamps), so downstream nodes like generate.subtitles work identically regardless of which backend is used.

faster-whisper (default)

The default backend. Runs the faster-whisper Python library in-process with CTranslate2 acceleration. Provides high-accuracy word-level timestamps via Dynamic Time Warping (DTW) cross-attention alignment.

Model: large-v3 (4.5 GB)
Device: Auto-selects GPU (CUDA/Metal) or CPU
Word timing: DTW cross-attention alignment (high precision)
Dependencies: faster-whisper Python package

Candle (built-in Rust)

A fully native Rust backend using candle-transformers. Zero Python dependencies — runs entirely within the Fabric binary. Downloads the model from HuggingFace Hub on first use.

Model: Any HuggingFace Whisper model (e.g., openai/whisper-large-v3)
Device: CPU or Metal (via candle)
Word timing: Timestamp token decoding with character-proportional interpolation
Dependencies: None (built into Fabric with candle feature)
Long audio: Automatically chunks audio into 30-second segments (Whisper’s native window) and concatenates results with offset timestamps

To use the Candle backend, set transcription_backend to "candle" in the workflow input or stage call:

from fabric_workflow_sdk.stages.captions import transcribe_audio

result = await transcribe_audio({
    "voiceover_path": "audio.mp3",
    "transcription_backend": "candle",  # use built-in Rust Whisper
})
# result["transcript"]  — same format as faster-whisper output
# result["full_text"]   — concatenated text
# result["duration"]    — audio duration in seconds

The Candle backend requires the Fabric server to be running with the candle feature enabled and a Whisper model registered in the provider config. Set FABRIC_PROVIDER_URL if the server is not on the default http://127.0.0.1:8080.

Backend Comparison

	faster-whisper	Candle
Language	Python (CTranslate2)	Rust (candle)
Word timing precision	High (DTW alignment)	Good (linear interpolation)
GPU support	CUDA, Metal	CPU, Metal
Dependencies	`faster-whisper` package	None (built-in)
Best for	Production accuracy	Zero-dependency deploys, Rust-native pipelines

Provider API (Candle)

The Candle Whisper backend is also available directly via the provider API. Pass word_timestamps: true in params to get word-level timing:

curl -X POST http://127.0.0.1:8080/v1/providers/execute \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "candle",
    "model": "whisper",
    "modality": "audio",
    "input": { "audio": "<base64-encoded WAV>" },
    "params": { "word_timestamps": true }
  }'

Response:

{
  "output": {
    "text": "Hello world, this is a test.",
    "words": [
      { "word": "Hello", "start": 0.0, "end": 0.32 },
      { "word": "world,", "start": 0.32, "end": 0.68 },
      { "word": "this", "start": 0.68, "end": 0.88 },
      { "word": "is", "start": 0.88, "end": 1.02 },
      { "word": "a", "start": 1.02, "end": 1.10 },
      { "word": "test.", "start": 1.10, "end": 1.44 }
    ],
    "duration": 1.5
  }
}

When word_timestamps is false (default), only {"text": "..."} is returned for backward compatibility.

Usage

import { WorkflowBuilder } from "@fabric-platform/sdk";

const workflow = new WorkflowBuilder("my-workflow")
  .node("audio-transcribe", "tool", (n) =>
    n.config({
      operation: "audio.transcribe",
      // ... node-specific config
    })
  )
  .build();

from fabric_platform import FabricClient

fabric = FabricClient(api_key="fab_xxx")

wf_id = fabric.upsert_workflow("my-workflow", nodes=[
    {
        "key": "audio-transcribe",
        "kind": "tool",
    },
])

use fabric_sdk::FabricClient;

let client = FabricClient::new("https://gofabric.dev", api_key)?;

let wf_id = client.upsert_workflow("my-workflow", serde_json::json!({
    "nodes": [{
        "key": "audio-transcribe",
        "kind": "tool"
    }]
})).await?;

curl -X POST https://gofabric.dev/v1/workflow-definitions \
  -H "Authorization: Bearer $FABRIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-workflow",
    "nodes": [{
      "key": "audio-transcribe",
      "kind": "tool"
    }]
  }'