Quick Summary
Audio Dictation converts spoken conversations into text in real-time. You can use this feature to enable transcription of patient-provider conversations.
Audio Dictation is supported for APIs only.SDKs include dictation as part of ambient note generation, but the standalone Audio Dictation feature (speech-to-text only) is available via APIs only.
Overview
Audio Dictation converts spoken conversations into text in real-time. You should use Dictation feature to enable transcription of patient-provider conversations in real-time. This help providers to focus on the conversation instead of worrying about transcribing the conversation.Key Features
Real-time Transcription
See text appear as people speak, no waiting for the conversation to end.
Multiple Sessions
Run multiple dictation sessions under one parent session for complex workflows.
WebSocket Streaming
Low-latency audio streaming for fast, responsive dictation.
Clean Transcripts
Automatically formatted with proper punctuation, capitalization, and filler words removed.
Intermediate and Final Texts
Receive both intermediate (partial) transcripts as speech is processed and final transcripts when segments are complete.
How It Works
Dictation works in three simple steps:- Create a session: Create a dictation session and get a
transcription_session_id - Stream audio: Connect via WebSocket and stream audio data to that session
- Receive transcripts: Get transcribed text in real-time as you stream
- Stream from multiple sources simultaneously
- Handle reconnections if a WebSocket drops
- Manage complex audio workflows
transcription_session_id, and transcripts from all streams are combined into one session.
Workflow
How To Use Dictation
Follow these steps to dictate audio:1
Create Parent Dictation Session
Create a parent session using the POST Create Dictation Session API. This returns a
transcription_session_id that you’ll use for all child sessions.Optional audio configuration:
You can customize audio settings when creating the session:Audio configuration is optional. If not provided, default values are used. Currently, only English (
en-US) is supported for dictation.2
Stream Audio via WebSocket
Connect to the WebSocket endpoint GET
/ws/transcribe and stream your audio data. You’ll receive transcribed text in real-time as you stream.Authentication:- Browser clients: Use
Sec-WebSocket-Protocolheader with format:SukiTranscriptionAuth,<sdp_suki_token>,<transcription_session_id> - Non-browser clients: Send
sdp_suki_tokenandtranscription_session_idas HTTP headers
3
Receive Real-time Transcripts
As you stream audio, Suki processes it and returns transcribed text immediately. Transcripts are automatically formatted with punctuation and capitalization.
4
End Dictation Session
When finished, call the end transcription session API to properly close the session.
cURL
Related APIs
Create Transcription Session
Create a parent transcription session
Stream Audio
Stream audio for real-time transcription
End Session
End a transcription session
Best Practices
FAQs
What's the difference from Ambient Documentation?
What's the difference from Ambient Documentation?
| Feature | Description |
|---|---|
| Dictation | Converts speech to text only, you get the raw transcript (with some formatting like punctuation, capitalization, and filler words removed) |
| Ambient Documentation | Converts speech to text AND generates structured clinical notes with LOINC codes and structured data |