audio.transcribe
Operation: audio.transcribe
Category: media
Tags: audio, whisper, transcription, speech-to-text, candle
Transcribe audio/video to text with word-level timestamps via Whisper. Supports two backends: faster-whisper (Python, default) and Candle (Rust, built-in).
Runtime
Section titled “Runtime”Type: Native (built-in) Timeout: 600s Retries: 2 (ExponentialWithJitter) Memory: 4096 MB
Inputs
Section titled “Inputs”| Name | Type | Required | Default | Description |
|---|---|---|---|---|
audio_url | Asset | No | — | URL or file:// path to an audio file |
audio_path | String | No | — | Local filesystem path to an audio file |
video_path | String | No | — | Local path to a video file (audio will be extracted) |
Outputs
Section titled “Outputs”| Name | Type | Description |
|---|---|---|
text | String | Full transcript text |
segments | Array<JSON> | Array of {start, end, text} segment objects |
words | Array<JSON> | Array of {start, end, word} word-level timestamp objects |
language | String | Detected language code |
duration_secs | Number | Audio duration in seconds |
Default Configuration
Section titled “Default Configuration”{ "model": "large-v3", "prefer_local": true, "word_timestamps": true}Transcription Backends
Section titled “Transcription Backends”Two backends are available for transcription. Both produce the same output format (segments + word-level timestamps), so downstream nodes like generate.subtitles work identically regardless of which backend is used.
faster-whisper (default)
Section titled “faster-whisper (default)”The default backend. Runs the faster-whisper Python library in-process with CTranslate2 acceleration. Provides high-accuracy word-level timestamps via Dynamic Time Warping (DTW) cross-attention alignment.
- Model:
large-v3(4.5 GB) - Device: Auto-selects GPU (CUDA/Metal) or CPU
- Word timing: DTW cross-attention alignment (high precision)
- Dependencies:
faster-whisperPython package
Candle (built-in Rust)
Section titled “Candle (built-in Rust)”A fully native Rust backend using candle-transformers. Zero Python dependencies — runs entirely within the Fabric binary. Downloads the model from HuggingFace Hub on first use.
- Model: Any HuggingFace Whisper model (e.g.,
openai/whisper-large-v3) - Device: CPU or Metal (via candle)
- Word timing: Timestamp token decoding with character-proportional interpolation
- Dependencies: None (built into Fabric with
candlefeature) - Long audio: Automatically chunks audio into 30-second segments (Whisper’s native window) and concatenates results with offset timestamps
To use the Candle backend, set transcription_backend to "candle" in the workflow input or stage call:
from fabric_workflow_sdk.stages.captions import transcribe_audio
result = await transcribe_audio({ "voiceover_path": "audio.mp3", "transcription_backend": "candle", # use built-in Rust Whisper})# result["transcript"] — same format as faster-whisper output# result["full_text"] — concatenated text# result["duration"] — audio duration in secondsThe Candle backend requires the Fabric server to be running with the candle feature enabled and a Whisper model registered in the provider config. Set FABRIC_PROVIDER_URL if the server is not on the default http://127.0.0.1:8080.
Backend Comparison
Section titled “Backend Comparison”| faster-whisper | Candle | |
|---|---|---|
| Language | Python (CTranslate2) | Rust (candle) |
| Word timing precision | High (DTW alignment) | Good (linear interpolation) |
| GPU support | CUDA, Metal | CPU, Metal |
| Dependencies | faster-whisper package | None (built-in) |
| Best for | Production accuracy | Zero-dependency deploys, Rust-native pipelines |
Provider API (Candle)
Section titled “Provider API (Candle)”The Candle Whisper backend is also available directly via the provider API. Pass word_timestamps: true in params to get word-level timing:
curl -X POST http://127.0.0.1:8080/v1/providers/execute \ -H "Content-Type: application/json" \ -d '{ "provider": "candle", "model": "whisper", "modality": "audio", "input": { "audio": "<base64-encoded WAV>" }, "params": { "word_timestamps": true } }'Response:
{ "output": { "text": "Hello world, this is a test.", "words": [ { "word": "Hello", "start": 0.0, "end": 0.32 }, { "word": "world,", "start": 0.32, "end": 0.68 }, { "word": "this", "start": 0.68, "end": 0.88 }, { "word": "is", "start": 0.88, "end": 1.02 }, { "word": "a", "start": 1.02, "end": 1.10 }, { "word": "test.", "start": 1.10, "end": 1.44 } ], "duration": 1.5 }}When word_timestamps is false (default), only {"text": "..."} is returned for backward compatibility.
import { WorkflowBuilder } from "@fabric-platform/sdk";
const workflow = new WorkflowBuilder("my-workflow") .node("audio-transcribe", "tool", (n) => n.config({ operation: "audio.transcribe", // ... node-specific config }) ) .build();from fabric_platform import FabricClient
fabric = FabricClient(api_key="fab_xxx")
wf_id = fabric.upsert_workflow("my-workflow", nodes=[ { "key": "audio-transcribe", "kind": "tool", },])use fabric_sdk::FabricClient;
let client = FabricClient::new("https://gofabric.dev", api_key)?;
let wf_id = client.upsert_workflow("my-workflow", serde_json::json!({ "nodes": [{ "key": "audio-transcribe", "kind": "tool" }]})).await?;curl -X POST https://gofabric.dev/v1/workflow-definitions \ -H "Authorization: Bearer $FABRIC_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "my-workflow", "nodes": [{ "key": "audio-transcribe", "kind": "tool" }] }'