Use this guide when you already have a Dictation session and need to stream live audio toDocumentation Index
Fetch the complete documentation index at: https://developer.suki.ai/llms.txt
Use this file to discover all available pages before exploring further.
GET /ws/transcribe for real-time transcription.
The WebSocket is only for the live audio stream and the real-time transcript frames that come back from the server. You still use the Dictation REST APIs to create the session before streaming, end the session after streaming, and retrieve final or cumulative results.
For endpoint details and code examples, refer to the Stream audio to dictation session. For the full REST workflow, refer to Audio dictation.
How Dictation streaming works
Suki for Partners dictation streaming works as follows:- Create or reuse a Dictation session.
- Open a WebSocket connection to
GET /ws/transcribe. - Send one JSON message per audio chunk.
- Send an explicit end-of-audio message when the user stops speaking.
- Read partial and final transcript messages from the socket.
- Close the socket, then use End dictation session to end the Dictation session and retrieve results.
Before you connect (Prerequisites)
OpenGET /ws/transcribe only when the dictation session is READY or IDLE.
If the session is RUNNING because another audio stream is active, COMPLETED, or in another state that cannot accept speech, the WebSocket handshake fails. The server returns FailedPrecondition with a message such as transcript session is not accepting new speech sessions.
One dictation session can support multiple speech sessions over time, such as push-to-talk. After you send
AUDIO_END and the server finishes processing that stream, wait until the session returns to READY or IDLE before opening another WebSocket for the next utterance.Send JSON text frames
Every message you send on/ws/transcribe must be a UTF-8 JSON text frame.
- Each WebSocket frame must contain exactly one JSON object.
- Each client
sendshould contain one logical message. - Audio bytes go inside a JSON string field, not in a binary WebSocket frame.
Message format
Each outbound message is a JSON object with atype field. For audio chunks, the payload field is audioData.
Audio chunks
Send audio withtype set to AUDIO:
audioData value must be:
- Standard Base64 (RFC 4648)
- An encoding of the raw PCM_S16LE bytes you intend to send
- Sent as a JSON string, regardless of the programming language you use
End-of-audio message
When you finish sending audio on the WebSocket, send oneEVENT message with event set to AUDIO_END:
Dictation
/ws/transcribe does not use START_TIME, does not use ambient-style AUDIO messages with a data field, and does not use the ambient end marker RU9G. Use audioData for chunks and EVENT with event: AUDIO_END when you are done sending audio.Required message order
For each logical stream of audio on the socket:- Send one or more
AUDIOmessages. Each message includes oneaudioDatachunk. - Send one
EVENTmessage withevent:AUDIO_ENDafter the last audio chunk you intend to send on that connection.
START_TIME step and no ambient RU9G end marker on this endpoint.
Read transcript frames
For each upstream ASR result, the gateway sends a JSON text frame on the WebSocket. Parseevent.data in onmessage, or use the equivalent message handler in your WebSocket stack.
For code samples, see Stream audio to dictation session.
Partial and final transcripts
Useis_final to decide whether the text should be shown as draft text or committed to the transcript.
is_final | Meaning |
|---|---|
false | Partial, or interim, transcript. The text may change in later messages as the recognizer refines its result. |
true | Final transcript for that segment. The recognizer has committed to this text, and it will not be revised. |
End of results
After the upstream stream ends, the server sends one terminal frame:transcript.transcript equal to EOF as the canonical end-of-results signal for that speech session. The WebSocket closes shortly after.
Handle transcript frames in your UI
- Empty transcripts are dropped server-side. You only receive frames where
transcript.transcriptis non-empty, except the terminalEOFframe. wordsis most useful on final frames. Partial frames typically include an emptywordsarray.transcript_idis generated for each emission, including partial frames. Two messages for the same spoken utterance can have different IDs. Do not dedupe bytranscript_id. Use it to order messages and to distinguish separate final segments from each other.- Append or replace UI text based on
is_final. Treat partials as draft text that may change, and commit finals to your transcript buffer.
Complete the session after streaming
After you sendAUDIO_END and finish reading results from the WebSocket, close the connection and complete the session with REST.
Close the WebSocket Connection
Close the WebSocket connection from the client when you are done sending audio on that connection.
End the Dictation Session Using REST
End the transcription session using End dictation session.
Retrieve Results Using REST
Follow Audio dictation to retrieve final or cumulative transcripts and clean up the session. Use the same
transcription_session_id and sdp_suki_token patterns as the rest of the Dictation APIs.End the session and retrieve transcripts using the REST flows linked in the steps above. The outbound WebSocket contract in this guide does not replace those APIs.
Audio format and chunking
Use raw PCM_S16LE audio chunks in eachaudioData message after Base64 decode.
PCM vs WAV
PCM_S16LE is raw audio data..wav is a container format and usually includes a header before the audio data.
If your source is WAV, skip the 44-byte header before chunking, or decode the file to raw PCM before sending. Use 0 as the offset if your buffer is already raw PCM. Sending WAV headers as PCM reduces recognition quality and makes debugging harder.
Recommended audio format
- Encoding: PCM_S16LE, PCM signed 16-bit little-endian, same family as LINEAR16 at 16 kHz mono in typical capture pipelines
- Channels: Mono
- Sample rate: 16 kHz, aligned with what your integration expects
Chunk size
Send audio in small chunks during streaming. For 16 kHz, mono, 16-bit audio, about 3200 bytes per chunk is a common choice, which is about 100 ms per message. Size chunks to your capture pipeline if your encoder differs.Encode each chunk
For everyAUDIO message:
- Take raw PCM_S16LE bytes.
- Encode the bytes using standard Base64 (RFC 4648).
- Send the encoded string as
audioData.
Example flow
This example shows the outbound message sequence for one stream: multipleAUDIO chunks, followed by AUDIO_END.
Troubleshooting
Sending WAV instead of PCM
Sending WAV instead of PCM
Symptom: Poor transcription or failuresFix: Strip the WAV header, for example 44 bytes, or decode to raw PCM_S16LE before Base64 encoding.
Incorrect Base64 encoding
Incorrect Base64 encoding
Symptom: Server parse errorsFix: Use standard Base64 (RFC 4648), not URL-safe Base64 or hex.
Sending multiple JSON objects in one frame
Sending multiple JSON objects in one frame
Symptom: Parsing errorsFix: Send one JSON object per WebSocket frame.
Missing or incorrect AUDIO_END
Missing or incorrect AUDIO_END
Symptom: Stream does not finalize, or the server does not know audio is completeFix: Send
{ "type": "EVENT", "event": "AUDIO_END" } after your last AUDIO message.Using data instead of audioData for AUDIO messages
Using data instead of audioData for AUDIO messages
Symptom: Ignored or invalid audio payloadsFix: Use
audioData for Base64 PCM_S16LE chunks on /ws/transcribe.Wrong Sec-WebSocket-Protocol in the browser
Wrong Sec-WebSocket-Protocol in the browser
Symptom: 401 on handshake or immediate disconnectFix: Use
SukiAmbientAuth,<sdp_suki_token>,<transcription_session_id>, not SukiTranscriptionAuth, and not the ambient session ID, token order.Skipping REST after streaming
Skipping REST after streaming
Symptom: Missing final transcript or session left openFix: Always call End dictation session to ensure the session is closed and the transcript is generated.
WebSocket fails to open with FailedPrecondition
WebSocket fails to open with FailedPrecondition
Symptom: Handshake fails with transcript session is not accepting new speech sessionsFix: Open
/ws/transcribe only when the dictation session is READY or IDLE. Wait until the previous speech session on that transcription_session_id has finished, you sent AUDIO_END, received EOF, and the session returned to READY or IDLE, before you connect again.UI flicker or missing word-level detail
UI flicker or missing word-level detail
Symptom: Partial text jumps unpredictably, or
words is always emptyFix: Update the UI from is_final: false frames as draft text. Use words on is_final: true frames for word-level or speaker-aware display. Do not treat transcript_id as a stable key for the same utterance across partials.