Skip to content

audio.transcribe

Operation: audio.transcribe Category: media Tags: audio, whisper, transcription, speech-to-text, candle

Transcribe audio/video to text with word-level timestamps via Whisper. Supports two backends: faster-whisper (Python, default) and Candle (Rust, built-in).

Type: Native (built-in) Timeout: 600s Retries: 2 (ExponentialWithJitter) Memory: 4096 MB

NameTypeRequiredDefaultDescription
audio_urlAssetNoURL or file:// path to an audio file
audio_pathStringNoLocal filesystem path to an audio file
video_pathStringNoLocal path to a video file (audio will be extracted)
NameTypeDescription
textStringFull transcript text
segmentsArray<JSON>Array of {start, end, text} segment objects
wordsArray<JSON>Array of {start, end, word} word-level timestamp objects
languageStringDetected language code
duration_secsNumberAudio duration in seconds
{
"model": "large-v3",
"prefer_local": true,
"word_timestamps": true
}

Two backends are available for transcription. Both produce the same output format (segments + word-level timestamps), so downstream nodes like generate.subtitles work identically regardless of which backend is used.

The default backend. Runs the faster-whisper Python library in-process with CTranslate2 acceleration. Provides high-accuracy word-level timestamps via Dynamic Time Warping (DTW) cross-attention alignment.

  • Model: large-v3 (4.5 GB)
  • Device: Auto-selects GPU (CUDA/Metal) or CPU
  • Word timing: DTW cross-attention alignment (high precision)
  • Dependencies: faster-whisper Python package

A fully native Rust backend using candle-transformers. Zero Python dependencies — runs entirely within the Fabric binary. Downloads the model from HuggingFace Hub on first use.

  • Model: Any HuggingFace Whisper model (e.g., openai/whisper-large-v3)
  • Device: CPU or Metal (via candle)
  • Word timing: Timestamp token decoding with character-proportional interpolation
  • Dependencies: None (built into Fabric with candle feature)
  • Long audio: Automatically chunks audio into 30-second segments (Whisper’s native window) and concatenates results with offset timestamps

To use the Candle backend, set transcription_backend to "candle" in the workflow input or stage call:

from fabric_workflow_sdk.stages.captions import transcribe_audio
result = await transcribe_audio({
"voiceover_path": "audio.mp3",
"transcription_backend": "candle", # use built-in Rust Whisper
})
# result["transcript"] — same format as faster-whisper output
# result["full_text"] — concatenated text
# result["duration"] — audio duration in seconds

The Candle backend requires the Fabric server to be running with the candle feature enabled and a Whisper model registered in the provider config. Set FABRIC_PROVIDER_URL if the server is not on the default http://127.0.0.1:8080.

faster-whisperCandle
LanguagePython (CTranslate2)Rust (candle)
Word timing precisionHigh (DTW alignment)Good (linear interpolation)
GPU supportCUDA, MetalCPU, Metal
Dependenciesfaster-whisper packageNone (built-in)
Best forProduction accuracyZero-dependency deploys, Rust-native pipelines

The Candle Whisper backend is also available directly via the provider API. Pass word_timestamps: true in params to get word-level timing:

Terminal window
curl -X POST http://127.0.0.1:8080/v1/providers/execute \
-H "Content-Type: application/json" \
-d '{
"provider": "candle",
"model": "whisper",
"modality": "audio",
"input": { "audio": "<base64-encoded WAV>" },
"params": { "word_timestamps": true }
}'

Response:

{
"output": {
"text": "Hello world, this is a test.",
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.32 },
{ "word": "world,", "start": 0.32, "end": 0.68 },
{ "word": "this", "start": 0.68, "end": 0.88 },
{ "word": "is", "start": 0.88, "end": 1.02 },
{ "word": "a", "start": 1.02, "end": 1.10 },
{ "word": "test.", "start": 1.10, "end": 1.44 }
],
"duration": 1.5
}
}

When word_timestamps is false (default), only {"text": "..."} is returned for backward compatibility.

import { WorkflowBuilder } from "@fabric-platform/sdk";
const workflow = new WorkflowBuilder("my-workflow")
.node("audio-transcribe", "tool", (n) =>
n.config({
operation: "audio.transcribe",
// ... node-specific config
})
)
.build();