Dictation API

Use the dictation APIs when you need speech-to-text without the full ambient clinical note flow. You create a transcription session, open a WebSocket to stream audio, then end the session when capture is finished so resources can close cleanly.

Usage scenarios

You need speech-to-text without the full ambient clinical note flow.
You want to create a transcription session, open a WebSocket to stream audio, then end the session when capture is finished so resources can close cleanly.

How it works

The dictation APIs work in four steps:

Authenticate

Create a transcription session

Call Create dictation session and save transcription_session_id.

Open a WebSocket to stream audio

Connect to Stream audio to dictation session when the session is READY or IDLE. Refer to Dictation streaming for outbound audio messages and inbound transcript frames.

End the session when capture is finished

Call End dictation session when capture is finished.

Endpoints

Create Dictation Session

Creates a new transcription session for real-time audio transcription

Stream Audio To Dictation Session

Establishes a WebSocket connection to the transcription service for real-time audio streaming and transcription. Open only when the dictation session is READY or IDLE. Inbound frames use TranscriptionStreamResponse (partial and final transcripts, then a terminal EOF frame)

End Dictation Session

Ends an active transcription session and returns the final transcription results

Last modified on May 22, 2026

End Ambient Session

Create Dictation Session

⌘I

​Usage scenarios

​How it works

​Endpoints

Usage scenarios

How it works

Endpoints