Skip to main content
Quick Summary
Audio Dictation converts spoken conversations into text in real-time. You can use this feature to enable transcription of patient-provider conversations.
Audio Dictation is supported for APIs only.SDKs include dictation as part of ambient note generation, but the standalone Audio Dictation feature (speech-to-text only) is available via APIs only.

Overview

Audio Dictation converts spoken conversations into text in real-time. You should use Dictation feature to enable transcription of patient-provider conversations in real-time. This help providers to focus on the conversation instead of worrying about transcribing the conversation.

Key Features

Real-time Transcription

See text appear as people speak, no waiting for the conversation to end.

Multiple Sessions

Run multiple dictation sessions under one parent session for complex workflows.

WebSocket Streaming

Low-latency audio streaming for fast, responsive dictation.

Clean Transcripts

Automatically formatted with proper punctuation, capitalization, and filler words removed.

Intermediate and Final Texts

Receive both intermediate (partial) transcripts as speech is processed and final transcripts when segments are complete.

How It Works

Dictation works in three simple steps:
  1. Create a session: Create a dictation session and get a transcription_session_id
  2. Stream audio: Connect via WebSocket and stream audio data to that session
  3. Receive transcripts: Get transcribed text in real-time as you stream
Multiple streams: You can create multiple WebSocket connections to stream audio to the same dictation session. This is useful if you need to:
  • Stream from multiple sources simultaneously
  • Handle reconnections if a WebSocket drops
  • Manage complex audio workflows
All streams use the same transcription_session_id, and transcripts from all streams are combined into one session.

Workflow

How To Use Dictation

Follow these steps to dictate audio:
1

Create Parent Dictation Session

Create a parent session using the POST Create Dictation Session API. This returns a transcription_session_id that you’ll use for all child sessions.Optional audio configuration: You can customize audio settings when creating the session:
{
  "audio_config": {
    "audio_encoding": "LINEAR16",      // Default: LINEAR16
    "audio_language": "en-US",         // Default: en-US
    "sample_rate_hertz": 16000         // Default: 16000
  }
}
Audio configuration is optional. If not provided, default values are used. Currently, only English (en-US) is supported for dictation.
2

Stream Audio via WebSocket

Connect to the WebSocket endpoint GET /ws/transcribe and stream your audio data. You’ll receive transcribed text in real-time as you stream.Authentication:
  • Browser clients: Use Sec-WebSocket-Protocol header with format: SukiTranscriptionAuth,<sdp_suki_token>,<transcription_session_id>
  • Non-browser clients: Send sdp_suki_token and transcription_session_id as HTTP headers
You can create multiple streaming sessions under the same parent session ID.
3

Receive Real-time Transcripts

As you stream audio, Suki processes it and returns transcribed text immediately. Transcripts are automatically formatted with punctuation and capitalization.
4

End Dictation Session

When finished, call the end transcription session API to properly close the session.
cURL
curl -X POST https://sdp.suki.ai/api/v1/transcription/session/{transcription_session_id}/end \
-H "sdp_suki_token: {sdp_suki_token}"

Best Practices

  • Stream in chunks: Send audio data in chunks rather than all at once for better performance
  • Handle errors gracefully: WebSocket connections can drop—implement reconnection logic
  • End sessions properly: Always call the end session API when finished to free up resources
  • Use appropriate audio settings: Match your audio encoding and sample rate to your source
  • Monitor session state: Track active sessions to prevent resource leaks
  • Test audio quality: Ensure your audio source meets the required specifications for best results

FAQs

FeatureDescription
DictationConverts speech to text only, you get the raw transcript (with some formatting like punctuation, capitalization, and filler words removed)
Ambient DocumentationConverts speech to text AND generates structured clinical notes with LOINC codes and structured data