Audio Dictation

Quick Summary

Audio Dictation converts spoken conversations into text in real-time. You can use this feature to enable transcription of patient-provider conversations.

Last updated:February 2026

This capability is supported by:

✔ APIs

SDKs include dictation as part of ambient note generation; the standalone Audio Dictation feature (speech-to-text only) is available via APIs only.

Overview

Audio Dictation converts spoken conversations into text in real-time. You should use Dictation feature to enable transcription of patient-provider conversations in real-time. This help providers to focus on the conversation instead of worrying about transcribing the conversation.

Key Features

Real-time Transcription

See text appear as people speak, no waiting for the conversation to end.

Multiple Sessions

Run multiple dictation sessions under one parent session for complex workflows.

WebSocket Streaming

Low-latency audio streaming for fast, responsive dictation.

Clean Transcripts

Automatically formatted with proper punctuation, capitalization, and filler words removed.

Intermediate and Final Texts

Receive both intermediate (partial) transcripts as speech is processed and final transcripts when segments are complete.

How It Works

Dictation works in three simple steps:

Create a session: Create a dictation session and get a transcription_session_id
Stream audio: Connect via WebSocket and stream audio data to that session
Receive transcripts: Get transcribed text in real-time as you stream

Multiple streams: You can create multiple WebSocket connections to stream audio to the same dictation session. This is useful if you need to:

Stream from multiple sources simultaneously
Handle reconnections if a WebSocket drops
Manage complex audio workflows

All streams use the same transcription_session_id, and transcripts from all streams are combined into one session.

Workflow

How To Use Dictation

Follow these steps to dictate audio:

Create Parent Dictation Session

Create a parent session using the POST Create Dictation Session API. This returns a transcription_session_id that you’ll use for all child sessions.Optional audio configuration: You can customize audio settings when creating the session:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",      // Default: LINEAR16
    "audio_language": "en-US",         // Default: en-US
    "sample_rate_hertz": 16000         // Default: 16000
  }
}

Audio configuration is optional. If not provided, default values are used. Currently, only English (en-US) is supported for dictation.

Stream Audio via WebSocket

Connect to the WebSocket endpoint GET /ws/transcribe and stream your audio data. You’ll receive transcribed text in real-time as you stream.Authentication:

Browser clients: Use Sec-WebSocket-Protocol header with format: SukiTranscriptionAuth,<sdp_suki_token>,<transcription_session_id>
Non-browser clients: Send sdp_suki_token and transcription_session_id as HTTP headers

You can create multiple streaming sessions under the same parent session ID.

Receive Real-time Transcripts

As you stream audio, Suki processes it and returns transcribed text immediately. Transcripts are automatically formatted with punctuation and capitalization.

End Dictation Session

When finished, call the end transcription session API to properly close the session.

cURL

curl -X POST https://sdp.suki.ai/api/v1/transcription/session/{transcription_session_id}/end \
-H "sdp_suki_token: {sdp_suki_token}"

Create Transcription Session

Create a parent transcription session

Stream Audio

Stream audio for real-time transcription

End Session

End a transcription session

Best Practices

Stream in chunks: Send audio data in chunks rather than all at once for better performance
Handle errors gracefully: WebSocket connections can drop—implement reconnection logic
End sessions properly: Always call the end session API when finished to free up resources
Use appropriate audio settings: Match your audio encoding and sample rate to your source
Monitor session state: Track active sessions to prevent resource leaks
Test audio quality: Ensure your audio source meets the required specifications for best results

FAQs

What's the difference from Ambient Documentation?

Feature	Description
Dictation	Converts speech to text only, you get the raw transcript (with some formatting like punctuation, capitalization, and filler words removed)
Ambient Documentation	Converts speech to text AND generates structured clinical notes with LOINC codes and structured data

Get Started

Onboarding & Authentication

Product Capabilities

Guides

Help & Support

Audio Dictation

Overview

Key Features

Real-time Transcription

Multiple Sessions

WebSocket Streaming

Clean Transcripts

Intermediate and Final Texts

How It Works

Workflow

How To Use Dictation

Create Transcription Session

Stream Audio

End Session

Best Practices

FAQs

Get Started

Onboarding & Authentication

Product Capabilities

Guides

Help & Support

​Overview

​Key Features

Real-time Transcription

Multiple Sessions

WebSocket Streaming

Clean Transcripts

Intermediate and Final Texts

​How It Works

​Workflow

​How To Use Dictation

​Related APIs

Create Transcription Session

Stream Audio

End Session

​Best Practices

​FAQs

Overview

Key Features

How It Works

Workflow

How To Use Dictation

Related APIs

Best Practices

FAQs