Skip to main content

Documentation Index

Fetch the complete documentation index at: https://developer.suki.ai/llms.txt

Use this file to discover all available pages before exploring further.

Use this guide when you already have a Dictation session and need to stream live audio to GET /ws/transcribe for real-time transcription. The WebSocket is only for the live audio stream and the real-time transcript frames that come back from the server. You still use the Dictation REST APIs to create the session before streaming, end the session after streaming, and retrieve final or cumulative results. For endpoint details and code examples, refer to the Stream audio to dictation session. For the full REST workflow, refer to Audio dictation.

How Dictation streaming works

Suki for Partners dictation streaming works as follows:
  1. Create or reuse a Dictation session.
  2. Open a WebSocket connection to GET /ws/transcribe.
  3. Send one JSON message per audio chunk.
  4. Send an explicit end-of-audio message when the user stops speaking.
  5. Read partial and final transcript messages from the socket.
  6. Close the socket, then use End dictation session to end the Dictation session and retrieve results.
Dictation streaming uses JSON text frames. Do not send raw binary audio frames to this endpoint.

Before you connect (Prerequisites)

Open GET /ws/transcribe only when the dictation session is READY or IDLE. If the session is RUNNING because another audio stream is active, COMPLETED, or in another state that cannot accept speech, the WebSocket handshake fails. The server returns FailedPrecondition with a message such as transcript session is not accepting new speech sessions.
One dictation session can support multiple speech sessions over time, such as push-to-talk. After you send AUDIO_END and the server finishes processing that stream, wait until the session returns to READY or IDLE before opening another WebSocket for the next utterance.

Send JSON text frames

Every message you send on /ws/transcribe must be a UTF-8 JSON text frame.
  • Each WebSocket frame must contain exactly one JSON object.
  • Each client send should contain one logical message.
  • Audio bytes go inside a JSON string field, not in a binary WebSocket frame.
Do not:
  • Send binary WebSocket frames for audio data on this JSON-based protocol.
  • Send more than one JSON object in a single frame.
  • Use HTTP endpoints to stream raw audio.
If the server receives a non-JSON payload where JSON is expected, parsing can fail with errors such as invalid character or null byte errors.

Message format

Each outbound message is a JSON object with a type field. For audio chunks, the payload field is audioData.

Audio chunks

Send audio with type set to AUDIO:
{ "type": "AUDIO", "audioData": "<base64-encoded PCM_S16LE audio>" }
The audioData value must be:
  • Standard Base64 (RFC 4648)
  • An encoding of the raw PCM_S16LE bytes you intend to send
  • Sent as a JSON string, regardless of the programming language you use
Do not use:
  • Hex encoding
  • URL-safe Base64
  • Raw binary inside JSON strings

End-of-audio message

When you finish sending audio on the WebSocket, send one EVENT message with event set to AUDIO_END:
{ "type": "EVENT", "event": "AUDIO_END" }
This message tells the server that no more audio chunks are coming for that stream.
Do not:
  • End dictation audio with { "type": "AUDIO", "data": "RU9G" }. That is the ambient /ws/stream pattern.
  • Use custom end markers instead of AUDIO_END or the inbound EOF. See Read transcript frames.
  • Use binary signaling in place of the JSON AUDIO_END message.
Dictation /ws/transcribe does not use START_TIME, does not use ambient-style AUDIO messages with a data field, and does not use the ambient end marker RU9G. Use audioData for chunks and EVENT with event: AUDIO_END when you are done sending audio.

Required message order

For each logical stream of audio on the socket:
  1. Send one or more AUDIO messages. Each message includes one audioData chunk.
  2. Send one EVENT message with event: AUDIO_END after the last audio chunk you intend to send on that connection.
There is no START_TIME step and no ambient RU9G end marker on this endpoint.

Read transcript frames

For each upstream ASR result, the gateway sends a JSON text frame on the WebSocket. Parse event.data in onmessage, or use the equivalent message handler in your WebSocket stack. For code samples, see Stream audio to dictation session.

Partial and final transcripts

Use is_final to decide whether the text should be shown as draft text or committed to the transcript.
is_finalMeaning
falsePartial, or interim, transcript. The text may change in later messages as the recognizer refines its result.
trueFinal transcript for that segment. The recognizer has committed to this text, and it will not be revised.
Example partial frame:
{
  "transcript": {
    "transcript": "the recognized text so far",
    "words": []
  },
  "is_final": false,
  "transcript_id": "01J9XABCDEFGHJKMNPQRSTVWXYZ"
}
Example final frame:
{
  "transcript": {
    "transcript": "The patient reports feeling better today",
    "words": [
      { "word": "The", "speaker": { "id": "S1" } },
      { "word": "patient", "speaker": { "id": "S1" } }
    ]
  },
  "is_final": true,
  "transcript_id": "01J9XWXYZABCDEFGHJKMNPQRSTUV"
}

End of results

After the upstream stream ends, the server sends one terminal frame:
{
  "transcript": {
    "transcript": "EOF"
  }
}
Treat transcript.transcript equal to EOF as the canonical end-of-results signal for that speech session. The WebSocket closes shortly after.

Handle transcript frames in your UI

  • Empty transcripts are dropped server-side. You only receive frames where transcript.transcript is non-empty, except the terminal EOF frame.
  • words is most useful on final frames. Partial frames typically include an empty words array.
  • transcript_id is generated for each emission, including partial frames. Two messages for the same spoken utterance can have different IDs. Do not dedupe by transcript_id. Use it to order messages and to distinguish separate final segments from each other.
  • Append or replace UI text based on is_final. Treat partials as draft text that may change, and commit finals to your transcript buffer.

Complete the session after streaming

After you send AUDIO_END and finish reading results from the WebSocket, close the connection and complete the session with REST.
1

Close the WebSocket Connection

Close the WebSocket connection from the client when you are done sending audio on that connection.
2

End the Dictation Session Using REST

End the transcription session using End dictation session.
3

Retrieve Results Using REST

Follow Audio dictation to retrieve final or cumulative transcripts and clean up the session. Use the same transcription_session_id and sdp_suki_token patterns as the rest of the Dictation APIs.
End the session and retrieve transcripts using the REST flows linked in the steps above. The outbound WebSocket contract in this guide does not replace those APIs.

Audio format and chunking

Use raw PCM_S16LE audio chunks in each audioData message after Base64 decode.

PCM vs WAV

PCM_S16LE is raw audio data. .wav is a container format and usually includes a header before the audio data. If your source is WAV, skip the 44-byte header before chunking, or decode the file to raw PCM before sending. Use 0 as the offset if your buffer is already raw PCM. Sending WAV headers as PCM reduces recognition quality and makes debugging harder.
  • Encoding: PCM_S16LE, PCM signed 16-bit little-endian, same family as LINEAR16 at 16 kHz mono in typical capture pipelines
  • Channels: Mono
  • Sample rate: 16 kHz, aligned with what your integration expects

Chunk size

Send audio in small chunks during streaming. For 16 kHz, mono, 16-bit audio, about 3200 bytes per chunk is a common choice, which is about 100 ms per message. Size chunks to your capture pipeline if your encoder differs.

Encode each chunk

For every AUDIO message:
  1. Take raw PCM_S16LE bytes.
  2. Encode the bytes using standard Base64 (RFC 4648).
  3. Send the encoded string as audioData.

Example flow

This example shows the outbound message sequence for one stream: multiple AUDIO chunks, followed by AUDIO_END.
{ "type": "AUDIO", "audioData": "<base64(pcm_s16le chunk 1)>" }
{ "type": "AUDIO", "audioData": "<base64(pcm_s16le chunk 2)>" }

{ "type": "EVENT", "event": "AUDIO_END" }

Troubleshooting

Symptom: Poor transcription or failuresFix: Strip the WAV header, for example 44 bytes, or decode to raw PCM_S16LE before Base64 encoding.
Symptom: Server parse errorsFix: Use standard Base64 (RFC 4648), not URL-safe Base64 or hex.
Symptom: Parsing errorsFix: Send one JSON object per WebSocket frame.
Symptom: Stream does not finalize, or the server does not know audio is completeFix: Send { "type": "EVENT", "event": "AUDIO_END" } after your last AUDIO message.
Symptom: Ignored or invalid audio payloadsFix: Use audioData for Base64 PCM_S16LE chunks on /ws/transcribe.
Symptom: 401 on handshake or immediate disconnectFix: Use SukiAmbientAuth,<sdp_suki_token>,<transcription_session_id>, not SukiTranscriptionAuth, and not the ambient session ID, token order.
Symptom: Missing final transcript or session left openFix: Always call End dictation session to ensure the session is closed and the transcript is generated.
Symptom: Handshake fails with transcript session is not accepting new speech sessionsFix: Open /ws/transcribe only when the dictation session is READY or IDLE. Wait until the previous speech session on that transcription_session_id has finished, you sent AUDIO_END, received EOF, and the session returned to READY or IDLE, before you connect again.
Symptom: Partial text jumps unpredictably, or words is always emptyFix: Update the UI from is_final: false frames as draft text. Use words on is_final: true frames for word-level or speaker-aware display. Do not treat transcript_id as a stable key for the same utterance across partials.
Last modified on May 22, 2026