Dictation Basic Usage

This guide covers: Ambient dictation REST and WebSocket APIs.

For a browser app, use the Web SDK for audio dictation or the Dictation SDK Beta instead of building this REST and WebSocket flow yourself.
For wire format, handshake rules, and troubleshooting, refer to the Dictation streaming guide.

This guide walks you through how to build a standalone audio dictation workflow with the Ambient dictation REST APIs. In this workflow, you:

Create a dictation session
Stream audio over WebSocket
Receive transcript updates
End the session when dictation is complete

Before you start, register the provider and get an sdp_suki_token. Refer to Provider authentication and Partner authentication for more information.

How the dictation workflow works

The dictation workflow uses both REST APIs and a WebSocket connection.

Authenticate and create a session with the REST API.
Stream audio to the session over WebSocket and process transcript updates.
End the session.

Create a dictation session

Create a parent transcription session before you open the WebSocket connection. The response includes a transcription_session_id, which identifies the session across the workflow. Use the transcription_session_id when you:

Connect to /ws/transcribe
End the dictation session with End dictation session

Call POST Create dictation session:

curl --request POST \
  --url https://sdp.suki-stage.com/api/v1/transcription/session/create \
  --header 'Content-Type: application/json' \
  --header 'sdp_suki_token: <sdp_suki_token>' \
  --data '{
    "audio_config": {
      "audio_encoding": "LINEAR16",
      "audio_language": "en-US",
      "sample_rate_hertz": 16000
    }
  }'

Example response:

{
  "transcription_session_id": "123dfg-456dfg-789dfg-012dfg"
}

The API returns 201 Created when the session is created successfully.

Request details

Include sdp_suki_token in every REST request and during the WebSocket handshake.
transcription_session_id is optional in the create request. If you omit it, Suki generates one.
audio_config is optional.
audio_encoding must be LINEAR16.

Stream audio over WebSocket

After you create the session, connect to GET wss://sdp.suki-stage.com/ws/transcribe when the session is READY or IDLE. If the session is already RUNNING, COMPLETED, or in another state, the WebSocket handshake fails with FailedPrecondition. For push-to-talk workflows that use multiple speech sessions with the same transcription_session_id, refer to Dictation streaming for more information.

Authenticate during the WebSocket handshake instead of sending credentials with each message.

Authentication

For non-browser clients, send these headers during the upgrade request:

sdp_suki_token
transcription_session_id

For browser clients, use Sec-WebSocket-Protocol:

Sec-WebSocket-Protocol: SukiAmbientAuth,<sdp_suki_token>,<transcription_session_id>

Send audio messages

Send audio data as JSON text frames. Example audio message:

{ "type": "AUDIO", "audioData": "<base64-encoded PCM_S16LE bytes>" }

After you finish streaming audio on the connection, send:

{ "type": "EVENT", "event": "AUDIO_END" }

Streaming requirements

Send one JSON object per WebSocket send.
Do not send raw binary frames.
Use audioData for dictation audio payloads.
Base64-encode PCM_S16LE audio bytes in audioData.
Do not use ambient streaming fields such as data or RU9G.

For wire format details, refer to Dictation streaming.

Receive transcript events

Parse transcript messages from event.data in your WebSocket onmessage handler. Refer to Stream audio to dictation session for Python and TypeScript code examples, and Dictation streaming for partial and final frames, EOF, and filtering rules. Example partial frame:

{
  "transcript": {
    "transcript": "the recognized text so far",
    "words": []
  },
  "is_final": false,
  "transcript_id": "01J9XABCDEFGHJKMNPQRSTVWXYZ" // example transcript_id
}

Example final frame:

{
  "transcript": {
    "transcript": "The patient reports feeling better today",
    "words": [
      { "word": "The", "speaker": { "id": "S1" } },
      { "word": "patient", "speaker": { "id": "S1" } }
    ]
  },
  "is_final": true,
  "transcript_id": "01J9XWXYZABCDEFGHJKMNPQRSTUV" // example transcript_id
}

is_final: false: partial (interim) text that may change in later messages.
is_final: true: final text for that segment; the recognizer will not revise it.
After the speech stream ends, the server sends { "transcript": { "transcript": "EOF" } }. Treat that as end-of-results for that WebSocket; the connection closes shortly after.

Do not dedupe messages by transcript_id. The server assigns a new ID per frame, including partials.

End the dictation session

When dictation is complete:

Send AUDIO_END
Close the WebSocket connection
End the session with the REST API

Call POST End dictation session:

curl --request POST \
  --url https://sdp.suki-stage.com/api/v1/transcription/session/<transcription_session_id>/end \
  --header 'sdp_suki_token: <sdp_suki_token>'

Example response:

{
  "transcription_session_id": "123dfg-456dfg-789dfg-012dfg",
  "status": "completed",
  "final_transcript": "The patient reports feeling better today. Blood pressure is stable at 120/80.",
  "duration": 300,
  "ended_at": "2024-11-26T10:35:00Z"
}

The API returns 200 OK on success. Possible status values include:

completed
cancelled
failed

Use final_transcript as the complete transcript for the session.

Common integration patterns

Pattern 1: Standard Ambient dictation flow

A typical Ambient dictation workflow follows these steps:

Authenticate

Obtain an sdp_suki_token for the provider.

Create the session

Call Create dictation session and save the transcription_session_id.

Stream audio

Open a WebSocket connection to /ws/transcribe when the session is READY or IDLE. Send AUDIO messages, handle partial and final inbound frames, and finish with AUDIO_END.

Process transcripts

Update the UI with partial and final transcript updates from incoming WebSocket messages.

End the session

Call End dictation session and store the final transcript output.

Pattern 2: Push-to-talk or reconnect

For push-to-talk, open /ws/transcribe when the session is READY or IDLE, stream one utterance, send AUDIO_END, wait for EOF, then close the WebSocket. When the session is READY or IDLE again, open a new WebSocket on the same transcription_session_id for the next utterance. If a connection drops mid-utterance, open a new WebSocket with the same transcription_session_id and sdp_suki_token only when the session accepts a new speech session. End the parent session with REST only when the full dictation workflow is complete.

Pattern 3: Configure audio settings

Pass audio_config in the create request when you need to specify:

Audio encoding
Language
Sample rate

Ensure that the streamed audio format matches the configuration you provide.

Pattern 4: Stream audio in chunks

For lower latency, send smaller AUDIO messages as audio becomes available instead of buffering the entire recording. After the final audio chunk, send AUDIO_END.

What you can build

Live Dictation Experiences

Display partial and final transcript updates in real time while the provider speaks.

Server-side Transcription Pipelines

Capture audio on a backend service, stream it through /ws/transcribe, and store the final transcript output.

Telehealth and In-person Workflows

Add transcription workflows to virtual or in-person clinical experiences without ambient note generation.

EHR and Form Integrations

Send transcript output into forms, notes, or downstream clinical systems after the session ends.

Create Dictation Session

Create a transcription session

Stream Audio for Dictation

Connect and stream audio over WebSocket

End Dictation Session

End a transcription session

Best practices

Stream audio in small chunks to reduce latency.
Reuse the same transcription_session_id when reconnecting WebSocket sessions.
End sessions after dictation completes to release resources.
Match streamed audio settings to your configured sample rate and encoding.
Validate audio quality before production deployment.

FAQs

What's the difference between dictation and ambient clinical documentation?

Feature	Description
Dictation	Converts speech into transcript text with formatting such as punctuation and capitalization.
Ambient clinical documentation	Transcribes conversations and generates structured clinical documentation outputs.

How is /ws/transcribe different from ambient /ws/stream?

Dictation uses /ws/transcribe with audioData payloads and AUDIO_END events.Ambient clinical documentation uses /ws/stream with different authentication flows and message formats.For more information, refer to Audio streaming and Dictation streaming.

Documentation Index

​How the dictation workflow works

​Create a dictation session

​Request details

​Stream audio over WebSocket

​Authentication

​Send audio messages

​Streaming requirements

​Receive transcript events

​End the dictation session

​Common integration patterns

​Pattern 1: Standard Ambient dictation flow

​Pattern 2: Push-to-talk or reconnect

​Pattern 3: Configure audio settings

​Pattern 4: Stream audio in chunks

​What you can build

Live Dictation Experiences

Server-side Transcription Pipelines

Telehealth and In-person Workflows

EHR and Form Integrations

​Related API references

Create Dictation Session

Stream Audio for Dictation

End Dictation Session

​Best practices

​FAQs

How the dictation workflow works

Create a dictation session

Request details

Stream audio over WebSocket

Authentication

Send audio messages

Streaming requirements

Receive transcript events

End the dictation session

Common integration patterns

Pattern 1: Standard Ambient dictation flow

Pattern 2: Push-to-talk or reconnect

Pattern 3: Configure audio settings

Pattern 4: Stream audio in chunks

What you can build

Related API references

Best practices

FAQs