Skip to main content

Documentation Index

Fetch the complete documentation index at: https://developer.suki.ai/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers: Ambient dictation REST and WebSocket APIs.
This guide walks you through how to build a standalone audio dictation workflow with the Ambient dictation REST APIs. In this workflow, you:
  • Create a dictation session
  • Stream audio over WebSocket
  • Receive transcript updates
  • End the session when dictation is complete
Before you start, register the provider and get an sdp_suki_token. Refer to Provider authentication and Partner authentication for more information.

How the dictation workflow works

The dictation workflow uses both REST APIs and a WebSocket connection.
  1. Authenticate and create a session with the REST API.
  2. Stream audio to the session over WebSocket and process transcript updates.
  3. End the session.

Create a dictation session

Create a parent transcription session before you open the WebSocket connection. The response includes a transcription_session_id, which identifies the session across the workflow. Use the transcription_session_id when you: Call POST Create dictation session:
curl --request POST \
  --url https://sdp.suki-stage.com/api/v1/transcription/session/create \
  --header 'Content-Type: application/json' \
  --header 'sdp_suki_token: <sdp_suki_token>' \
  --data '{
    "audio_config": {
      "audio_encoding": "LINEAR16",
      "audio_language": "en-US",
      "sample_rate_hertz": 16000
    }
  }'
Example response:
{
  "transcription_session_id": "123dfg-456dfg-789dfg-012dfg"
}
The API returns 201 Created when the session is created successfully.

Request details

  • Include sdp_suki_token in every REST request and during the WebSocket handshake.
  • transcription_session_id is optional in the create request. If you omit it, Suki generates one.
  • audio_config is optional.
  • audio_encoding must be LINEAR16.

Stream audio over WebSocket

After you create the session, connect to GET wss://sdp.suki-stage.com/ws/transcribe when the session is READY or IDLE. If the session is already RUNNING, COMPLETED, or in another state, the WebSocket handshake fails with FailedPrecondition. For push-to-talk workflows that use multiple speech sessions with the same transcription_session_id, refer to Dictation streaming for more information.
Authenticate during the WebSocket handshake instead of sending credentials with each message.

Authentication

For non-browser clients, send these headers during the upgrade request:
  • sdp_suki_token
  • transcription_session_id
For browser clients, use Sec-WebSocket-Protocol:
Sec-WebSocket-Protocol: SukiAmbientAuth,<sdp_suki_token>,<transcription_session_id>

Send audio messages

Send audio data as JSON text frames. Example audio message:
{ "type": "AUDIO", "audioData": "<base64-encoded PCM_S16LE bytes>" }
After you finish streaming audio on the connection, send:
{ "type": "EVENT", "event": "AUDIO_END" }

Streaming requirements

  • Send one JSON object per WebSocket send.
  • Do not send raw binary frames.
  • Use audioData for dictation audio payloads.
  • Base64-encode PCM_S16LE audio bytes in audioData.
  • Do not use ambient streaming fields such as data or RU9G.
For wire format details, refer to Dictation streaming.

Receive transcript events

Parse transcript messages from event.data in your WebSocket onmessage handler. Refer to Stream audio to dictation session for Python and TypeScript code examples, and Dictation streaming for partial and final frames, EOF, and filtering rules. Example partial frame:
{
  "transcript": {
    "transcript": "the recognized text so far",
    "words": []
  },
  "is_final": false,
  "transcript_id": "01J9XABCDEFGHJKMNPQRSTVWXYZ" // example transcript_id
}
Example final frame:
{
  "transcript": {
    "transcript": "The patient reports feeling better today",
    "words": [
      { "word": "The", "speaker": { "id": "S1" } },
      { "word": "patient", "speaker": { "id": "S1" } }
    ]
  },
  "is_final": true,
  "transcript_id": "01J9XWXYZABCDEFGHJKMNPQRSTUV" // example transcript_id
}
  • is_final: false: partial (interim) text that may change in later messages.
  • is_final: true: final text for that segment; the recognizer will not revise it.
  • After the speech stream ends, the server sends { "transcript": { "transcript": "EOF" } }. Treat that as end-of-results for that WebSocket; the connection closes shortly after.
Do not dedupe messages by transcript_id. The server assigns a new ID per frame, including partials.

End the dictation session

When dictation is complete:
  1. Send AUDIO_END
  2. Close the WebSocket connection
  3. End the session with the REST API
Call POST End dictation session:
curl --request POST \
  --url https://sdp.suki-stage.com/api/v1/transcription/session/<transcription_session_id>/end \
  --header 'sdp_suki_token: <sdp_suki_token>'
Example response:
{
  "transcription_session_id": "123dfg-456dfg-789dfg-012dfg",
  "status": "completed",
  "final_transcript": "The patient reports feeling better today. Blood pressure is stable at 120/80.",
  "duration": 300,
  "ended_at": "2024-11-26T10:35:00Z"
}
The API returns 200 OK on success. Possible status values include:
  • completed
  • cancelled
  • failed
Use final_transcript as the complete transcript for the session.

Common integration patterns

Pattern 1: Standard Ambient dictation flow

A typical Ambient dictation workflow follows these steps:
1

Authenticate

Obtain an sdp_suki_token for the provider.
2

Create the session

Call Create dictation session and save the transcription_session_id.
3

Stream audio

Open a WebSocket connection to /ws/transcribe when the session is READY or IDLE. Send AUDIO messages, handle partial and final inbound frames, and finish with AUDIO_END.
4

Process transcripts

Update the UI with partial and final transcript updates from incoming WebSocket messages.
5

End the session

Call End dictation session and store the final transcript output.

Pattern 2: Push-to-talk or reconnect

For push-to-talk, open /ws/transcribe when the session is READY or IDLE, stream one utterance, send AUDIO_END, wait for EOF, then close the WebSocket. When the session is READY or IDLE again, open a new WebSocket on the same transcription_session_id for the next utterance. If a connection drops mid-utterance, open a new WebSocket with the same transcription_session_id and sdp_suki_token only when the session accepts a new speech session. End the parent session with REST only when the full dictation workflow is complete.

Pattern 3: Configure audio settings

Pass audio_config in the create request when you need to specify:
  • Audio encoding
  • Language
  • Sample rate
Ensure that the streamed audio format matches the configuration you provide.

Pattern 4: Stream audio in chunks

For lower latency, send smaller AUDIO messages as audio becomes available instead of buffering the entire recording. After the final audio chunk, send AUDIO_END.

What you can build

Live Dictation Experiences

Display partial and final transcript updates in real time while the provider speaks.

Server-side Transcription Pipelines

Capture audio on a backend service, stream it through /ws/transcribe, and store the final transcript output.

Telehealth and In-person Workflows

Add transcription workflows to virtual or in-person clinical experiences without ambient note generation.

EHR and Form Integrations

Send transcript output into forms, notes, or downstream clinical systems after the session ends.

Create Dictation Session

Create a transcription session

Stream Audio for Dictation

Connect and stream audio over WebSocket

End Dictation Session

End a transcription session

Best practices

  • Stream audio in small chunks to reduce latency.
  • Reuse the same transcription_session_id when reconnecting WebSocket sessions.
  • End sessions after dictation completes to release resources.
  • Match streamed audio settings to your configured sample rate and encoding.
  • Validate audio quality before production deployment.

FAQs

FeatureDescription
DictationConverts speech into transcript text with formatting such as punctuation and capitalization.
Ambient clinical documentationTranscribes conversations and generates structured clinical documentation outputs.
Dictation uses /ws/transcribe with audioData payloads and AUDIO_END events.Ambient clinical documentation uses /ws/stream with different authentication flows and message formats.For more information, refer to Audio streaming and Dictation streaming.
Last modified on May 22, 2026